Equation Solution  
    High Performance by Design

 
Page: 3
 
Implementation of Constant-Bandwidth Solver of LAIPE2 on 48 Cores
 
Parallel Performance of Skyline Solver on Soft Cores (3)
 
Parallel Performance of Skyline Solver on Soft Cores (2)
 
Parallel Performance of Skyline Solver on Soft Cores (I)
 
Parallel Matrix Multiplication on Soft Cores
 

  1   2   3   4   5   6  



Parallel Performance of Skyline Solver on Soft Cores (3)


[Posted by Jenn-Ching Luo on Apr. 22, 2012 ]

        Co-array is a FORTRAN standard, in which cooperative tasks (e.g., images) communicate with each other by passing messages. Message-passing is different from memory-sharing, e.g., OpenMP, which allows cooperative tasks to directly access common data.

        Some people claimed co-array, which is based on passing messages, is more efficient than OpenMP that is implemented in a memory-sharing environment. One of their concerns is memory-sharing environment requires an extra cost for maintaining cache coherence. However, we haven't seen an actual comparison to support such claim. We don't know if co-array is more efficient than OpenMP.

        As introduced before, neuLoop has two types of soft core, homogeneous core which is based on memory-sharing communication and heterogeneous core which is based on message-passing communication. This post does not directly compare co-array with OpenMP, but uses the laipe2 function laipe$decompose_vsp_8 to show different performance between message-passing and memory-sharing communications.

        It is easy for us to compare performances on message-passing and memory-sharing communications. First, we run the example by linking against homogeneous cores to get timing results, and then run the program again with heterogeneous cores. Timing results are as:

With Homogeneous Cores
number of cores elapsed time (sec.) speedup efficiency (%)
1 242.93 1.00 100.00
2 121.82 1.99 99.71
3 81.48 2.98 99.38
4 61.49 3.95 98.77
5 49.61 4.90 97.94
6 41.72 5.82 97.05
7 36.05 6.74 96.27
8 31.84 7.63 95.37


With Heterogeneous Cores
number of cores elapsed time (sec.) speedup efficiency (%)
1 248.91 1.00 100.00
2 125.02 1.99 99.55
3 83.77 2.97 99.05
4 63.32 3.93 98.27
5 51.11 4.87 97.40
6 43.06 5.78 96.34
7 37.38 6.66 95.13
8 33.31 7.47 93.41

From the above data, we can see directly accessing memory run faster than message-passing. It could be realized that, in memory-sharing environment, cooperative tasks can directly access common data. Passing message, which duplicates data, becomes an extra burden in memory-sharing environment, which definitely takes extra costs. Whether co-array is more efficient than OpenMP is unclear. In this writer's experience, if additional cost to maintain cache coherence is not significant, direct memory-access does not degrade parallel performance.