@math.vt.edu
Professor of Mathematics
Virginia Tech
Scopus Publications
Scholar Citations
Scholar h-index
Scholar i10-index
Elia Merzari, Steven Hamilton, Thomas Evans, Misun Min, Paul Fischer, Stefan Kerkemeier, Jun Fang, Paul Romano, Yu-Hsiang Lan, Malachi Phillips,et al.
ACM
ENRICO is a coupled application developed under the U.S. Department of Energy's Exascale Computing Project (ECP) targeting the modeling of advanced nuclear reactors. It couples radiation transport with heat and fluid simulation, including the high-fidelity, highresolution Monte-Carlo code Shift and the Computational fluid dynamics code NekRS. NekRS is a highly-performant open-source code for simulation of incompressible and low-Mach fluid flow, heat transfer, and combustion with a particular focus on turbulent flows in complex domains. It is based on rapidly convergent high-order spectral element discretizations that feature minimal numerical dissipation and dispersion. State-of-the-art multilevel preconditioners, efficient high-order time-splitting methods, and runtime-adaptive communication strategies are built on a fast OCCA-based kernel library, libParanumal, to provide scalability and portability across the spectrum of current and future high-performance computing platforms. On Frontier, Nek5000/RS has recently achieved an unprecedented milestone in breaching over 1 billion spectral elements and 350 billion degrees of freedom. Shift has demonstrated the capability to transport upwards of 1 billion particles per second in full core nuclear reactor simulations featuring complete temperature-dependent, continuous-energy physics on Frontier. Shift achieved a weak-scaling efficiency of 97.8% on 8192 nodes of Frontier and calculated 6 reactions in 214,896 fuel pin regions below 1% statistical error yielding first-of-a-kind resolution for a Monte Carlo transport application.
Noel Chalmers, Abhishek Mishra, Damon McDougall, and Tim Warburton
SAGE Publications
We present hipBone, an open-source performance-portable proxy application for the Nek5000 (and NekRS) computational fluid dynamics applications. HipBone is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithmic and implementation improvements which optimize its performance on modern fine-grain parallel GPU accelerators. Our optimizations include a conversion to store the degrees of freedom of the problem in assembled form in order to reduce the amount of data moved during the main iteration and a portable implementation of the main Poisson operator kernel. We demonstrate near-roofline performance of the operator kernel on three different modern GPU accelerators from two different vendors. We present a novel algorithm for splitting the application of the Poisson operator on GPUs which aggressively hides MPI communication required for both halo exchange and assembly. Our implementation of nearest-neighbor MPI communication then leverages several different routing algorithms and GPU-Direct RDMA capabilities, when available, which improves scalability of the benchmark. We demonstrate the performance of hipBone on three different clusters housed at Oak Ridge National Laboratory, namely, the Summit supercomputer and the Frontier early-access clusters, Spock and Crusher. Our tests demonstrate both portability across different clusters and very good scaling efficiency, especially on large problems.
Elia Merzari, Steven Hamilton, Thomas Evans, Misun Min, Paul Fischer, Stefan Kerkemeier, Jun Fang, Paul Romano, Yu-Hsiang Lan, Malachi Phillips,et al.
ACM
ENRICO is a coupled application developed under the U.S. Department of Energy's Exascale Computing Project (ECP) targeting the modeling of advanced nuclear reactors. It couples radiation transport with heat and fluid simulation, including the high-fidelity, highresolution Monte-Carlo code Shift and the Computational fluid dynamics code NekRS. NekRS is a highly-performant open-source code for simulation of incompressible and low-Mach fluid flow, heat transfer, and combustion with a particular focus on turbulent flows in complex domains. It is based on rapidly convergent high-order spectral element discretizations that feature minimal numerical dissipation and dispersion. State-of-the-art multilevel preconditioners, efficient high-order time-splitting methods, and runtime-adaptive communication strategies are built on a fast OCCA-based kernel library, libParanumal, to provide scalability and portability across the spectrum of current and future high-performance computing platforms. On Frontier, Nek5000/RS has recently achieved an unprecedented milestone in breaching over 1 billion spectral elements and 350 billion degrees of freedom. Shift has demonstrated the capability to transport upwards of 1 billion particles per second in full core nuclear reactor simulations featuring complete temperature-dependent, continuous-energy physics on Frontier. Shift achieved a weak-scaling efficiency of 97.8% on 8192 nodes of Frontier and calculated 6 reactions in 214,896 fuel pin regions below 1% statistical error yielding first-of-a-kind resolution for a Monte Carlo transport application.
Anders Melander, Emil Strøm, Finnur Pind, Allan P Engsig-Karup, Cheol-Ho Jeong, Tim Warburton, Noel Chalmers, and Jan S Hesthaven
SAGE Publications
We present a massively parallel and scalable nodal discontinuous Galerkin finite element method (DGFEM) solver for the time-domain linearized acoustic wave equations. The solver is implemented using the libParanumal finite element framework with extensions to handle curvilinear geometries and frequency dependent boundary conditions of relevance in practical room acoustics. The implementation is benchmarked on heterogeneous multi-device many-core computing architectures, and high performance and scalability are demonstrated for a problem that is considered expensive to solve in practical applications. In a benchmark study, scaling tests show that multi-GPU support gives the ability to simulate large rooms, over a broad frequency range, with realistic boundary conditions, both in terms of computing time and memory requirements. Furthermore, numerical simulations on two non-trivial geometries are presented, a star-shaped room with a dome and an auditorium. Overall, this shows the viability of using a multi-device accelerated DGFEM solver to enable realistic large-scale wave-based room acoustics simulations.
Paul Fischer, Stefan Kerkemeier, Misun Min, Yu-Hsiang Lan, Malachi Phillips, Thilina Rathnayake, Elia Merzari, Ananias Tomboulides, Ali Karakus, Noel Chalmers,et al.
Elsevier BV
A. Karakus, N. Chalmers, and T. Warburton
Elsevier BV
Jesse Chan, Hendrik Ranocha, Andrés M. Rueda-Ramírez, Gregor Gassner, and Tim Warburton
Frontiers Media SA
High order entropy stable schemes provide improved robustness for computational simulations of fluid flows. However, additional stabilization and positivity preserving limiting can still be required for variable-density flows with under-resolved features. We demonstrate numerically that entropy stable Discontinuous Galerkin (DG) methods which incorporate an “entropy projection” are less likely to require additional limiting to retain positivity for certain types of flows. We conclude by investigating potential explanations for this observed improvement in robustness.
Misun Min, Yu-Hsiang Lan, Paul Fischer, Elia Merzari, Stefan Kerkemeier, Malachi Phillips, Thilina Rathnayake, April Novak, Derek Gaston, Noel Chalmers,et al.
IEEE
Nek5000/RS, a highly-performant open-source spectral element code, has recently achieved an unprecedented milestone in the simulation of nuclear reactors: the first full core computational fluid dynamics simulations of reactor cores, including pebble beds with 352,625 pebbles and 98M spectral elements (51 billion gridpoints), advanced in less than 0.25 seconds per Navier-Stokes timestep. The authors present performance and optimization considerations necessary to achieve this milestone when running on all of Summit. These optimizations led to a fourfold reduction in time-to-solution, making it possible to perform high-fidelity simulations of a single flow-through time in less than six hours for a full reactor core under prototypical conditions.
Jesse Chan, Yimin Lin, and Tim Warburton
Elsevier BV
Ahmad Abdelfattah, Valeria Barra, Natalie Beams, Ryan Bleile, Jed Brown, Jean-Sylvain Camier, Robert Carson, Noel Chalmers, Veselin Dobrev, Yohann Dudouit,et al.
Elsevier BV
Tzanio Kolev, Paul Fischer, Misun Min, Jack Dongarra, Jed Brown, Veselin Dobrev, Tim Warburton, Stanimire Tomov, Mark S Shephard, Ahmad Abdelfattah,et al.
SAGE Publications
Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free/partially assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of next-generation discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications.
Anthony P. Austin, Noel Chalmers, and Tim Warburton
Society for Industrial & Applied Mathematics (SIAM)
We consider several methods for generating initial guesses when iteratively solving sequences of linear systems, showing that they can be implemented efficiently in GPU-accelerated PDE solvers, specifically solvers for incompressible flow. We propose new initial guess methods based on stabilized polynomial extrapolation and compare them to the projection method of Fischer [15], showing that they are generally competitive with projection schemes despite requiring only half the storage and performing considerably less data movement and communication. Our implementations of these algorithms are freely available as part of the libParanumal collection of GPU-accelerated flow solvers.
Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton, Kasia Świrydowicz,et al.
SAGE Publications
Performance tests and analyses are critical to effective high-performance computing software development and are central components in the design and implementation of computational algorithms for achieving faster simulations on existing and future computing architectures for large-scale application problems. In this article, we explore performance and space-time trade-offs for important compute-intensive kernels of large-scale numerical solvers for partial differential equations (PDEs) that govern a wide range of physical applications. We consider a sequence of PDE-motivated bake-off problems designed to establish best practices for efficient high-order simulations across a variety of codes and platforms. We measure peak performance (degrees of freedom per second) on a fixed number of nodes and identify effective code optimization strategies for each architecture. In addition to peak performance, we identify the minimum time to solution at 80% parallel efficiency. The performance analysis is based on spectral and p-type finite elements but is equally applicable to a broad spectrum of numerical PDE discretizations, including finite difference, finite volume, and h-type finite elements.
A. Karakus, N. Chalmers, J.S. Hesthaven, and T. Warburton
Elsevier BV
A. Karakus, N. Chalmers, K. Świrydowicz, and T. Warburton
Elsevier BV
Kasia Świrydowicz, Noel Chalmers, Ali Karakus, and Tim Warburton
SAGE Publications
This article is devoted to graphics processing unit (GPU) kernel optimization and performance analysis of three tensor-product operations arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close to peak performance for these operators requires extensive optimization because of the operators’ properties: low arithmetic intensity, tiered structure, and the need to store intermediate results during the kernel execution. We give a guided overview of optimization strategies and we present a performance model that allows us to compare the efficacy of these optimizations against an empirically calibrated roofline.
Daniel S Abdi, Francis X Giraldo, Emil M Constantinescu, Lester E Carr, Lucas C Wilcox, and Timothy C Warburton
SAGE Publications
We present the acceleration of an IMplicit–EXplicit (IMEX) nonhydrostatic atmospheric model on manycore processors such as graphic processing units (GPUs) and Intel’s Many Integrated Core (MIC) architecture. IMEX time integration methods sidestep the constraint imposed by the Courant–Friedrichs–Lewy condition on explicit methods through corrective implicit solves within each time step. In this work, we implement and evaluate the performance of IMEX on manycore processors relative to explicit methods. Using 3D-IMEX at Courant number C = 15, we obtained a speedup of about 4× relative to an explicit time stepping method run with the maximum allowable C = 1. Moreover, the unconditional stability of IMEX with respect to the fast waves means the speedup can increase significantly with the Courant number as long as the accuracy of the resulting solution is acceptable. We show a speedup of 100× at C = 150 using 1D-IMEX to demonstrate this point. Several improvements on the IMEX procedure were necessary in order to outperform our results with explicit methods: (a) reducing the number of degrees of freedom of the IMEX formulation by forming the Schur complement, (b) formulating a horizontally explicit vertically implicit 1D-IMEX scheme that has a lower workload and better scalability than 3D-IMEX, (c) using high-order polynomial preconditioners to reduce the condition number of the resulting system, and (d) using a direct solver for the 1D-IMEX method by performing and storing LU factorizations once to obtain a constant cost for any Courant number. Without all of these improvements, explicit time integration methods turned out to be difficult to beat. We discuss in detail the IMEX infrastructure required for formulating and implementing efficient methods on manycore processors. Several parametric studies are conducted to demonstrate the gain from each of the abovementioned improvements. Finally, we validate our results with standard benchmark problems in numerical weather prediction and evaluate the performance and scalability of the IMEX method using up to 4192 GPUs and 16 Knights Landing processors.
Arturo Vargas, Thomas Hagstrom, Jesse Chan, and Tim Warburton
Springer Science and Business Media LLC
Daniel S Abdi, Lucas C Wilcox, Timothy C Warburton, and Francis X Giraldo
SAGE Publications
We present a Graphics Processing Unit (GPU)-accelerated nodal discontinuous Galerkin method for the solution of the three-dimensional Euler equations that govern the motion and thermodynamic state of the atmosphere. Acceleration of the dynamical core of atmospheric models plays an important practical role in not only getting daily forecasts faster, but also in obtaining more accurate (high resolution) results within a given simulation time limit. We use algorithms suitable for the single instruction multiple thread architecture of GPUs to accelerate our model by two orders of magnitude relative to one core of a CPU. Tests on one node of the Titan supercomputer show a speedup of up to 15 times using the K20X GPU as compared to that on the 16-core AMD Opteron CPU. The scalability of the multi-GPU implementation is tested using 16,384 GPUs, which resulted in a weak scaling efficiency of about 90%. Finally, the accuracy and performance of our GPU implementation is verified using several benchmark problems representative of different scales of atmospheric dynamics.
Niklas Wintermeyer, Andrew R. Winters, Gregor J. Gassner, and Timothy Warburton
Elsevier BV
Ali Karakus, Tim Warburton, Mehmet Haluk Aksel, and Cuneyt Sert
Emerald
Purpose This study aims to focus on the development of a high-order discontinuous Galerkin method for the solution of unsteady, incompressible, multiphase flows with level set interface formulation. Design/methodology/approach Nodal discontinuous Galerkin discretization is used for incompressible Navier–Stokes, level set advection and reinitialization equations on adaptive unstructured elements. Implicit systems arising from the semi-explicit time discretization of the flow equations are solved with a p-multigrid preconditioned conjugate gradient method, which minimizes the memory requirements and increases overall run-time performance. Computations are localized mostly near the interface location to reduce computational cost without sacrificing the accuracy. Findings The proposed method allows to capture interface topology accurately in simulating wide range of flow regimes with high density/viscosity ratios and offers good mass conservation even in relatively coarse grids, while keeping the simplicity of the level set interface modeling. Efficiency, local high-order accuracy and mass conservation of the method are confirmed through distinct numerical test cases of sloshing, dam break and Rayleigh–Taylor instability. Originality/value A fully discontinuous Galerkin, high-order, adaptive method on unstructured grids is introduced where flow and interface equations are solved in discontinuous space.
Noel Chalmers and T. Warburton
Society for Industrial & Applied Mathematics (SIAM)
We propose a new formulation of a low-order elliptic preconditioner for high-order triangular elements. In the preconditioner, the nodes of the low-order finite element problem do not necessarily c...
Jesse Chan and T. Warburton
Elsevier BV
A. Modave, A. Atle, J. Chan, and T. Warburton
Wiley
Discontinuous Galerkin finite element schemes exhibit attractive features for accurate large‐scale wave‐propagation simulations on modern parallel architectures. For many applications, these schemes must be coupled with nonreflective boundary treatments to limit the size of the computational domain without losing accuracy or computational efficiency, which remains a challenging task. In this paper, we present a combination of a nodal discontinuous Galerkin method with high‐order absorbing boundary conditions for cuboidal computational domains. Compatibility conditions are derived for high‐order absorbing boundary conditions intersecting at the edges and the corners of a cuboidal domain. We propose a GPU implementation of the computational procedure, which results in a multidimensional solver with equations to be solved on 0D, 1D, 2D, and 3D spatial regions. Numerical results demonstrate both the accuracy and the computational efficiency of our approach.
Arturo Vargas, Jesse Chan, Thomas Hagstrom, and Timothy Warburton
Global Science Press
AbstractHermite methods, as introduced by Goodrich et al. in [15], combine Hermite interpolation and staggered (dual) grids to produce stable high order accurate schemes for the solution of hyperbolic PDEs. We introduce three variations of this Hermite method which do not involve time evolution on dual grids. Computational evidence is presented regarding stability, high order convergence, and dispersion/dissipation properties for each new method. Hermite methods may also be coupled to discontinuous Galerkin (DG) methods for additional geometric flexibility [4]. An example illustrates the simplification of this coupling for Hermite methods.