TY - GEN
T1 - Fast and scalable NUMA-based thread parallel breadth-first search
AU - Yasui, Yuichiro
AU - Fujisawa, Katsuki
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/9/2
Y1 - 2015/9/2
N2 - The breadth-first search (BFS) is one of the most centric kernels in graph processing. Beamer's direction-optimizing BFS algorithm, which selects one of two traversal directions at each level, can reduce unnecessary edge traversals. In a previous paper, we presented an efficient BFS for a non-uniform memory access (NUMA)-based system, in which the NUMA architecture was carefully considered. In this paper, we investigate the locality of memory accesses in terms of the communication with remote memories in a BFS for a NUMA system, and describe a fast and highly scalable implementation. Our new implementation achieves performance rates of 174.704 billion edges per second for a Kronecker graph with 233 vertices and 237 edges on two racks of a SGI UV 2000 system with 1,280 threads. The implementations described in this paper achieved the fastest entries for a shared-memory system in the June 2014 and November 2014 Graph500 lists, and produced the most energy-efficient entries in the second, third, and fourth Green Graph500 lists (big data category).
AB - The breadth-first search (BFS) is one of the most centric kernels in graph processing. Beamer's direction-optimizing BFS algorithm, which selects one of two traversal directions at each level, can reduce unnecessary edge traversals. In a previous paper, we presented an efficient BFS for a non-uniform memory access (NUMA)-based system, in which the NUMA architecture was carefully considered. In this paper, we investigate the locality of memory accesses in terms of the communication with remote memories in a BFS for a NUMA system, and describe a fast and highly scalable implementation. Our new implementation achieves performance rates of 174.704 billion edges per second for a Kronecker graph with 233 vertices and 237 edges on two racks of a SGI UV 2000 system with 1,280 threads. The implementations described in this paper achieved the fastest entries for a shared-memory system in the June 2014 and November 2014 Graph500 lists, and produced the most energy-efficient entries in the second, third, and fourth Green Graph500 lists (big data category).
UR - http://www.scopus.com/inward/record.url?scp=84948393040&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84948393040&partnerID=8YFLogxK
U2 - 10.1109/HPCSim.2015.7237065
DO - 10.1109/HPCSim.2015.7237065
M3 - Conference contribution
AN - SCOPUS:84948393040
T3 - Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015
SP - 377
EP - 385
BT - Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015
A2 - Smari, Waleed W.
A2 - Zeljkovic, Vesna
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th International Conference on High Performance Computing and Simulation, HPCS 2015
Y2 - 20 July 2015 through 24 July 2015
ER -