Dynamic optimization of load balance in MPI broadcast

Takesi Soga, Kouji Kurihara, Takeshi Nanri, Motoyoshi Kurokawa, Kazuaki Murakami

    Research output: Chapter in Book/Report/Conference proceedingConference contribution


    There are many algorithms that compose broadcast from point-to-point communications, such as Binary Tree and Binomial Tree. Though many implementations of these algorithms are proposed in MPI libraries like MPICH [1], most of them are based on an assumption that all processes begin the broadcast at the same time. That means the orders of the point-to-point communications in the broadcast are arranged numerically, according to the rank of each process. However, naturally each process starts broadcast at different times, mainly because of the imbalance of workload of each process. That causes unnecessary waiting time on processes. Also, in a broadcast algorithm such as binomial tree algorithm, the amount of communication is different for each process therefore load imbalance is increased by the occurrence of both heavy workload and heavy communications on a same process. Our method purposes to solve these problems dynamically. This method solves these problems by profiling the workload of each rank at runtime and adjusting the orders of point-to-point communications according to the information. In the various algorithms of MPI_Bcast, binomial tree is one of the most popular one. It broadcasts a message to all M processes in the group with logM steps of point-to-point communications. At each step, each process that has already received the data sends data to the process which has not received yet. Because the number of send operations is different in each rank, if a heavy workload is assigned to the rank that invokes many send operations in the tree, whole load-imbalance causes the longer wait time at the ranks that receives the message from the heavy-loaded rank. Also, the performance of the broadcast changes according to the starting time of broadcast at each rank, even if the same algorithm is chosen. The difference of the starting times is caused mainly by the load-imbalance such as the difference of instruction counts or cache efficiency at each node. Generally, these kinds of differences are not easy to predict before executing programs. Therefore, it is important to consider the better implementation of the algorithm according to the behavior of the program at runtime. To adjust the implementation of algorithm to the behavior of the program, we introduce a virtual rank and a virtual rank table. The virtual rank represents the positions of processes in the collective communication. And the virtual rank table that maps real ranks to the virtual ranks. The implementation can be adjusted according to the load balance of each rank by changing the entries of the virtual rank table. The amount of the load of each rank is determined by the waiting time in MPI_Bcast. Virtual ranks that receive the message in earlier steps are responsible for larger numbers of sends to other ranks. Therefore, by mapping the real ranks with longer waiting time in previous MPI_Bcasts to the virtual ranks that receive the message earlier, the total waiting time can be reduced. This dynamic optimization method changes the entries as follows. At first, the wait time of each rank is measured from the wait operation for the receive request in each MPLBcast. Then, this wait time is compared with that of previous MPI_Bcast. If the difference is larger than a threshold, the rank calls MPLPut to send the information of the wait time to the optimizer rank. Once after N times of MPI_Bcast, the optimization is executed on the optimizer rank. From the information arrived so far, it finds the rank with minimum wait time and that with maximum wait time and mark them as a candidate to exchange the entries of the virtual rank table. After the optimization phase, the application phase is executed on all the rank in the communicator. In this time, the information of the pair of the ranks that exchange the entry of the table and the count N that shows next optimization time is propagated. On arrival of the information, each rank exchanges the entries of the virtual rank table of its own according to the information. This method has been build experimentally on RSCC(RIKEN Super Combined Cluster) at RIKEN Japan. The experiment uses an MPI program that invokes MPI_Bcast. In addition to that, a pseudo load is executed before each MPLBcast on the rank of the first receiver of the message from the root rank in the original binomial tree. Therefore, extra load on this rank is critical to the entire performance of MPI_Bcast. This experiment shows that the overall execution time of the MPI_Bcast can be reduced by around 40%.

    Original languageEnglish
    Title of host publicationRecent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings
    PublisherSpringer Verlag
    Number of pages2
    ISBN (Print)9783540754152
    Publication statusPublished - 2007
    Event14th European PVM/MPI Users' Group Meeting on Parallel Virtual Machine and Message Passing Interface - Paris, France
    Duration: Sept 30 2007Oct 3 2007

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume4757 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349


    Other14th European PVM/MPI Users' Group Meeting on Parallel Virtual Machine and Message Passing Interface

    All Science Journal Classification (ASJC) codes

    • Theoretical Computer Science
    • General Computer Science


    Dive into the research topics of 'Dynamic optimization of load balance in MPI broadcast'. Together they form a unique fingerprint.

    Cite this