Poster Title: 
Poster Abstract: 
Author First Name: 
Author  Last Name: 



Author Name:  Pasquale De Luca
Poster Title:  First experiences on parallelizing Dynamics of Cellular Potts Model via GPU
Poster Abstract: 

In recent years, there has been an increasing interest in developing in vitro models that predict the behavior of cells in living organisms.  Mathematical models based on differential equations, and related numerical algorithms, have been provided to this aim.  In this work, we present first experiences in designing parallel strategies for accelerating an algorithm for behavior prediction based on the Cellular Potts Model (CPM). In particular, we exploit the computational power of Graphic Process Units in CUDA environment to address main low-level kernels involved. Tests and experiments complete the paper. 

Poster File URL:  View Poster File


Author Name:  Thiago Assumpcao
Poster Title:  Initial data for extreme binary black holes with the NRPyElliptic solver
Poster Abstract: 
The two-body problem in General Relativity (GR) involves studying the dynamics of two massive objects under the influence of gravity. To accurately model collisions of black holes or neutron stars, numerical relativity solves the full system of Einstein’s equations of GR. Simulating these systems is crucial for generating predictions of gravitational-wave signals, which can be compared against experimental data from laser interferometers.

To enable long-term dynamical evolutions, Einstein's equations are reformulated as an initial value problem. The state of the system at time t=0, the initial data, must satisfy a set of constraints typically comprised of four coupled, elliptic partial differential equations (PDEs). Once the initial data are found, the dynamical sector of Einstein's equations is used to march the system forward in time on a discretized numerical grid. During time evolution, the numerical domain is divided into smaller regions that are handled by different MPI processes. Each region communicates with its neighbors so that physical quantities are correctly set on their boundaries.
NRPyElliptic is an open-source, elliptic PDE solver for initial data in numerical relativity. Employing an optimized implementation of the hyperbolic relaxation method, NRPyElliptic transforms the elliptic system of equations into a hyperbolic form. In this work, a more sophisticated mathematical framework is being implemented to enable the setup of realistic initial configurations for black hole binaries. Notably, this approach addresses the crucial challenge of handling highly-spinning black holes, which are potential sources of gravitational waves.

NRPyElliptic, similar to other initial data codes, is currently designed to run on single-node computers. To optimize computation, it utilizes OpenMP for parallelization, enabling efficient handling of lengthy expressions that require updates at each time step across the three-dimensional computational domain. However, exploring additional shared-memory parallelization techniques, such as GPU offloading, holds promise for further enhancing the code's performance.
Poster File URL:  View Poster File


Author Name:  Katie Worton
Poster Title:  Automation Development for Linux Kernel Functional Testing
Poster Abstract: 

In the Linux Kernel Functional Testing (LKFT) team at Linaro, we aim to improve Linux kernel quality on Arm-based devices by performing regression testing and reporting results on selected kernel branches [1].

This is a very large-scale task, requiring much computational resource and engineering time while simultaneously generating masses of data.

One challenging problem within this task is handling tests that cause catastrophic issues. In particular, tests that hang or cause issues with the physical boards where the testing occurs. Running this kind of catastrophically failing test with the regular regression tests is undesirable, as it may cause infrastructure issues that negatively impact the rest of the testing (for example, taking a board offline). Consequently, tests that cause such problems are placed in a 'skiplist' that specifies combinations of boards, branches and tests known to cause these catastrophic issues. This list is then passed to the regression testing infrastructure, and tests on the lists are skipped to avoid problems arising.

However, once tests have been placed in the skiplist, it becomes difficult to determine if they still cause issues. As previously discussed, we don't want to run these tests in the regular regression testing, and it is also unfeasible to manually rerun these tests, as there are too many test, board and branch combinations for this to be achievable. For this reason, I have been working on an automated solution for retesting skiplist entries and updating the skiplist based on these results.

To achieve this skiplist retesting, the following steps were automated:

  • Querying the data from previous kernel builds and runs to find appropriate test reproducer scripts for each of the board and branch combinations we wish to test.
  • Editing the fetched test reproducer scripts to run each of the skiplist tests.
  • Pushing the results from this testing back to the backend for storage.
  • Fetching the results from the backend to create a summary in a readable format.
  • Automatically creating git commits and pull requests when the retesting determines that a test no longer causes issues.

Sources:

[1] https://lkft.linaro.org/

Poster File URL:  View Poster File


Author Name:  Marius Herget
Poster Title:  Design Space Exploration for Distributed Cyber-Physical Systems
Poster Abstract: 

Cyber-Physical Systems (CPS) comprise one of the largest information-technology sectors worldwide, driving innovation in other crucial industrial sectors such as health, industrial automation and robotics, avionics and space. Nowadays, the embedded compute infrastructure of complex CPS is based on heterogeneous multi-core or many-core systems, which are distributed and connected via complex networks. Our research project, "DSE2.0", addresses the question of how to perform efficient and effective Design Space Exploration (DSE) for those dCPS systems.

Design space exploration (DSE) is a process for identifying and evaluating design alternatives for computer systems, often involving simulations to predict and compare their behaviour. However, scaling simulations to large, complex, distributed cyber-physical systems (dCPS), like ASML TWINSCAN, is incredibly challenging. Furthermore, the application workload (typically containing hundreds of software processes) and the various mappings of the application workload on these platforms already make the search space vast, but this is exacerbated by the fact that application workloads in dCPS typically are not static. For example, in the case of ASML lithography scanner machines, the application workload behaviour is highly dependent on factors such as the wafer size, recipe (mask) complexity, required accuracy, application configuration settings, external influences like customer or fab cronjobs, emergent dynamic behaviour of the system, etc. All of these factors complicate the modelling efforts and contribute to an ever-increasing number of design points.

From the perspective of High-Performance Computing (HPC), various state-of-the-art scalability techniques for system-level simulation environments, including Simulation Campaigns, Parallel Discrete Event Simulations (PDES), and Hardware Accelerators, are under investigation. We aim to confront and overcome the scalability challenge in DSE for dCPS. Hence a suitable HPC architecture needs to be developed to provide an efficient and competent evaluation environment. Ultimately, our project is a foundational step towards exploring the largely uncharted territory of efficient DSE technology for dCPS.
Poster File URL:  View Poster File


Author Name:  Kalman Szenes
Poster Title:  Tensor Computations on GPUs: From Dense Contractions to Sparse Decompositions
Poster Abstract: 

Tensor algorithms are a rapidly growing field of research with applications in many scientific domains ranging from machine learning to quantum physics. Their prevalence in modern scientific computing tasks highlights the need for efficient implementations and motivates the use of Graphics Processing Units (GPUs), given the exceptional performance that they have in linear algebra tasks.

This work evaluates the performance of current state-of-the-art GPU tensor algebra routines for dense and sparse data. We propose a set of guidelines for achieving optimal performance with cuTensor, NVIDIA’s proprietary dense tensor library. Additionally, we demonstrate that our Tensor Times Matrix implementation matches and also, in some cases, exceeds the performance of cuTensor while also utilizing 33% less memory. Furthermore, we show that our approach delivers an up to 2.5-fold increase in throughput for memory-bound tensor operations, such as the Khatri-Rao product. We also implement a GPU-accelerated sparse Tucker decomposition based on the Compressed Sparse Fiber tensor format. Compared to other GPU implementations, our approach achieves a 59-fold speedup in executing the Higher Order Orthogonal Iteration algorithm used to compute the Tucker decomposition.


Poster File URL:  View Poster File


Author Name:  Ho-Chun Lin
Poster Title:  Fast multi-source nanophotonic simulations using augmented partial factorization
Poster Abstract: 

Numerical solutions of Maxwell's equations play a crucial role in nanophotonics and electromagnetics. However, their applicability to large systems, particularly multi-channel configurations like disordered media, aperiodic metasurfaces, and densely packed photonic circuits, faces limitations due to the need for numerous large-scale simulations. Traditionally, these simulations involve solving Maxwell's equations on each element of a discretization basis set, resulting in an abundance of unnecessary information. Moreover, performing simulations one input at a time can be time-consuming and repetitive. To address these challenges, we propose a novel approach that eliminates the need for full-basis solutions and enables direct computation of the quantities of interest. Our method involves augmenting the Maxwell operator with input source profiles and output projection profiles, followed by a single partial factorization that yields the entire generalized scattering matrix through the Schur complement, with no approximation beyond discretization. Our method applies to any linear partial differential equation. By harnessing the power of high-performance supercomputers (HPC) and leveraging parallel computing techniques such as hybrid MPI/multithreading programming, we can achieve significant acceleration in computation. This parallelization strategy enables us to achieve orders of magnitude speed-up compared to existing methods.

Poster File URL:  View Poster File


Author Name:  Ghanendra Das
Poster Title:  Innovating Aerospace Structures with Digital Design using Topology Optimization
Poster Abstract: 

As we enter a new era characterized by urban air mobility, environmental sustainability, and electrification, the traditional approach to aircraft design, driven by engineering intuition and experience, must be expanded. The established design and manufacturing methods that have served us in the past need to be complemented with innovative and unconventional design approaches that go beyond learned biases and empirical assumptions. In this study, we aim to develop cutting-edge digital design tools and algorithms that are inspired by robust mathematical foundations and insights from nature.

To achieve this, we focus on employing topology optimization techniques, which allow us to explore novel aircraft designs that are not limited by traditional constraints. Our research involves the creation of a highly scalable aerostructural design optimization framework. This framework utilizes hybrid OpenMP-MPI parallelization to enable efficient analysis and optimization of a full-scale aircraft wing in transonic flight. Both high-fidelity computational fluid dynamics (CFD) and finite element analysis (FEA) are employed in this process.

Additionally, we explore the application of topology optimization in the design of an e/VTOL fuselage, specifically a unibody chassis capable of withstanding static rotor loads. Furthermore, we investigate the design of the fuselage subfloor structure to enhance its energy-absorbing capabilities in the event of an aircraft crash. These endeavors require the development of highly efficient and parallel transient crash-dynamics solvers.

By pushing the boundaries of traditional aircraft design methodologies and incorporating advanced digital tools, we aim to pave the way for the development of more efficient, sustainable, and safer aircraft structures.


Poster File URL:  View Poster File


Author Name:  Jianshu Zhao
Poster Title:  Ultra-Fast and Scalable Genome Search by Combining Probabilistic Data Structures and Graph-based Nearest Neighbor Search
Poster Abstract: 

Genome search and/or classification is a key step in microbiome studies and has recently become more challenging due to the increasing number of available (reference) genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer-based probabilistic data structures (e.g., SuperMinHash, ProbMinHash and set cardinality-based, such as SetSketch to fill the gap between HyperLogLog and MinHash) to estimate genomic distance, with a graph based nearest neighbor search algorithm (called Hierarchical Navigable Small World Graphs, or HNSW), we developed a computer program, GSearch, that is orders of magnitude faster than alternative tools as database size continue to grow due to O(log(N)) time complexity and is also memory efficient while maintaining high ANI (Average nucleotide Identity) and AAI (Average Amino Acid Identity) accuracy. GSearch can identify/classify 8,000 query genomes against all available microbial and viral species with sequenced genome representatives (n=~318,000 and ~3,000,000) within several minutes on a personal laptop, using only ~6GB of memory or less (e.g. 3.0G via SetSketch). Further, GSearch can scale well with billions of database genomes based on a database splitting strategy. We also developed a three-step classification pipeline to search either genomes or their proteomes to maximize specificity and sensitivity for distantly related genomes. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification. More importantly, GSearch provides a general framework to combine hashing-based sketching algorithm for distance/similarity estimation with graph-based nearest neighbor search algorithm, which can be further applied in other fields like string search and document/text search. GSearch is available here: https://github.com/jean-pierreBoth/gsearch 



Poster File URL:  View Poster File


Author Name:  Yongzhong Li
Poster Title:  Multigrid Boundary Element Method for the Fast Electromagnetic Analysis of Multiscale Structures
Poster Abstract: 

Many electromagnetic structures of interest to industry and academia are deeply multiscale in nature. Antenna arrays, metasurfaces and  metamaterials are the primary examples. While containing deeply sub-wavelength features in their unit cells, they can be several wavelengths wide, and can be mounted on platforms that are even larger. Their multiscale nature, which is given by the disparity between their largest dimension and their tiniest geometrical feature, makes their design very challenging, since it makes a full-wave analysis extremely time consuming and often unfeasible, even with state of art algorithms and supercomputing facilities. Similar challenges arise in the design of the interconnect networks found in integrated circuits, packages and printed circuit boards, where electromagnetic simulations are indispensable to manage interference, crosstalk, signal and power integrity. Such networks can span several millimeters or centimeters, but contain fine features  of the order of the micrometer or even lower. Combined with their extremely intricate layout, density and frequencies of operation as high as several tens or even hundreads of Gigahertz, these features make a full-wave electromagnetic analysis often prohibitive. Multiscale problems are known to reduce the efficiency of most algorithms for computational electromagnetism, including integral equation methods and the boundary element method (BEM). The latter are often advocated for antennas and interconnect networks due to their intrinsic ability to model open problems and layered substrates while requiring only a discretization of the scatterers.

To address this challenge, we propose an acceleration technique for the BEM that is suitable for the electromagnetic analysis of multiscale structures. The proposed method uses a hierarchy of coarser grids to accelerate the matrix-vector product between boundary element operators and unknown source vectors. Unlike other grid-based acceleration schemes, the proposed method adapts to multiscale structures disregarding the resolution of uniform grid. This is achieved by adaptively adjusting the projection stencils to comply with elements with different geometrical scale. Our proposed method exhibits unmatched efficiency when compared to existing formulations, demonstrating a speed increase of 7.1 to 19.2 times over conventional method in the analysis of several realistic structures. To build upon the advantages offered by high-performance computing resources and advanced parallelization algorithms, our method can scale with the increasing computational power and multicore architectures.



Poster File URL:  View Poster File


Author Name:  Melissa Kozul
Poster Title:  Direct Numerical Simulation of Performance-Enhancing Riblets on Gas Turbine Compressor Blades
Poster Abstract: 

This research seeks to reduce the fuel usage and emissions of gas turbines as used in aeroplanes and electricity generation, as well as improve their stability and performance, by better understanding of the fast and chaotic gas flows over their individual blades. My work represents the first direct numerical simulation (where the equations of fluid flow are directly solved on a very fine grid) of potentially drag-reducing streamwise microgrooves (‘riblets’) on an axial high pressure compressor blade, using an immersed boundary method to resolve the riblets. The primary research tool HiPSTAR has been used to conduct world-first simulations of gas turbine components using realistic geometry and engine-relevant flow conditions. Due to the typical problem size, MPI decomposition is used. The domain decomposition is such that in-plane directions are split up into MPI subdomains and the third direction is parallelized using OpenACC (with some backends in CUDA). The code scales very well with typical runs using 924 V100 GPUs (Summit, Oak Ridge, USA). Good scaling of the core algorithms has also been demonstrated on Frontier at Oak Ridge, on AMD MI250 GPUs. 

Poster File URL:  View Poster File