TY - GEN
T1 - Performance Evaluation of Lattice Boltzmann Method for Fluid Simulation on A64FX Processor and Supercomputer Fugaku
AU - Watanabe, Seiya
AU - Hu, Changhong
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/1/7
Y1 - 2022/1/7
N2 - The lattice Boltzmann method has recently become popular as an alternative to Navier-Stokes solvers for large-scale fluid simulations. We conduct a performance study of the lattice Boltzmann method on the A64FX Arm-based processor of the supercomputer Fugaku. We compared four types of data layouts: SoA, AoS, Clusterd SoA (CSoA), and CSoA2, and three algorithms for the LBM streaming step: Pull, Push, and Swap schemes. The performance measurement on a single CMG (Core Memory Group) shows that the combination of the CSoA2 layout and the Swap scheme has the highest performance of 176 GFLOP, which corresponds to 11.5% of the single-precision peak performance. Our simulations have demonstrated good weak scaling up to 16,384 nodes and achieved high performance 10.9 PFLOPS in single precision. The strong scalability is also a good result, with parallel efficiencies of 63.9%, 68.3% and 72.7 % for the D3Q15, D3Q19 and D3Q27 velocity model, respectively when scaling from 512 to 16,384 nodes.
AB - The lattice Boltzmann method has recently become popular as an alternative to Navier-Stokes solvers for large-scale fluid simulations. We conduct a performance study of the lattice Boltzmann method on the A64FX Arm-based processor of the supercomputer Fugaku. We compared four types of data layouts: SoA, AoS, Clusterd SoA (CSoA), and CSoA2, and three algorithms for the LBM streaming step: Pull, Push, and Swap schemes. The performance measurement on a single CMG (Core Memory Group) shows that the combination of the CSoA2 layout and the Swap scheme has the highest performance of 176 GFLOP, which corresponds to 11.5% of the single-precision peak performance. Our simulations have demonstrated good weak scaling up to 16,384 nodes and achieved high performance 10.9 PFLOPS in single precision. The strong scalability is also a good result, with parallel efficiencies of 63.9%, 68.3% and 72.7 % for the D3Q15, D3Q19 and D3Q27 velocity model, respectively when scaling from 512 to 16,384 nodes.
KW - A64FX
KW - Fugaku
KW - data structure
KW - large scale CFD simulation
KW - lattice Boltzmann method
UR - http://www.scopus.com/inward/record.url?scp=85122642880&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122642880&partnerID=8YFLogxK
U2 - 10.1145/3492805.3492811
DO - 10.1145/3492805.3492811
M3 - Conference contribution
AN - SCOPUS:85122642880
T3 - ACM International Conference Proceeding Series
SP - 1
EP - 9
BT - Proceedings of International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2022
PB - Association for Computing Machinery
T2 - 5th International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2022
Y2 - 12 January 2022 through 14 January 2022
ER -