TY - GEN
T1 - Directive-Based Auto-Tuning for the Finite Difference Method on the Xeon Phi
AU - Katagiri, Takahiro
AU - Ohshima, Satoshi
AU - Matsumoto, Masaharu
N1 - Funding Information:
Acknowledgments: This study was supported by the "ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT)" program of Basic Research Programs: CREST, Development of System Software Technologies for post-Peta Scale High Performance Computing, Japan Science and Technology Agency (JST), Japan. We would like to thank all members of the ppOpen-HPC project, especially Professor Kengo Nakajima at the University of Tokyo, for supporting our study. We would also like to thank Professor Takashi Furamura and Dr. Futoshi Mori at The University of Tokyo for providing us with ppOpen-APPL/FDM.
Publisher Copyright:
© 2015 IEEE.
PY - 2015/9/29
Y1 - 2015/9/29
N2 - In this paper, we present a directive-based auto-tuning (AT) framework, called ppOpen-AT, and demonstrate its effect using simulation code based on the Finite Difference Method (FDM). The framework utilizes well-known loop transformation techniques. However, the codes used are carefully designed to minimize the software stack in order to meet the requirements of a many-core architecture currently in operation. The results of evaluations conducted using ppOpen-AT indicate that maximum speedup factors greater than 550% are obtained when it is applied in eight nodes of the Intel Xeon Phi. Further, in the AT for data packing and unpacking, a 49% speedup factor for the whole application is achieved. By using it with strong scaling on 32 nodes in a cluster of the Xeon Phi, we also obtain 24% speedups for the overall execution.
AB - In this paper, we present a directive-based auto-tuning (AT) framework, called ppOpen-AT, and demonstrate its effect using simulation code based on the Finite Difference Method (FDM). The framework utilizes well-known loop transformation techniques. However, the codes used are carefully designed to minimize the software stack in order to meet the requirements of a many-core architecture currently in operation. The results of evaluations conducted using ppOpen-AT indicate that maximum speedup factors greater than 550% are obtained when it is applied in eight nodes of the Intel Xeon Phi. Further, in the AT for data packing and unpacking, a 49% speedup factor for the whole application is achieved. By using it with strong scaling on 32 nodes in a cluster of the Xeon Phi, we also obtain 24% speedups for the overall execution.
UR - http://www.scopus.com/inward/record.url?scp=84962291305&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84962291305&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2015.11
DO - 10.1109/IPDPSW.2015.11
M3 - Conference contribution
AN - SCOPUS:84962291305
T3 - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015
SP - 1221
EP - 1230
BT - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015
Y2 - 25 May 2015 through 29 May 2015
ER -