Grandpa on Multicores (4)

[Posted by Jenn-Ching Luo on May 02, 2011 ]

This post shows how grandpa LAIPE decomposes a general dense matrix on multicores. Dense matrix is the simplest type, for example, in a form as:

       /                       \
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
 [A] = |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       \                       /

We are going to see how efficient grandpa LAIPE decomposes the matrix [A] into [L][U] in parallel. We did not often see a performance of multicores on the internet. This blog conducts a serious of performance tests, and provides us with an opportunity to see speedup and efficiency. In this post, we are going to see how efficient multicores speed up a decomposition of a dense matrix into triangular matrices.

The test matrix is of order (3000x3000), and is in 16-byte REAL variables. This post applied a pseudo random process to generate the matrix. The test example called LAIPE subroutine, decompose_DAG_16, to decompose matrix. Grandpa LAIPE is ancient software. As mentioned previously, grandpa LAIPE was programmed in ancient parallel concept, whose base language is fortran-77. The testing platform remained the same as the first post "Grandpa on Multicores (1)". We obtained a set of timing results as follows.

core: 1
Elapsed Time (Seconds): 2739.05
CPU Time in User Mode (Seconds): 2738.96
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 2739.05

cores: 2
Elapsed Time (Seconds): 1373.54
CPU Time in User Mode (Seconds): 2735.79
CPU Time in Kernel Mode (Seconds): 0.19
Total CPU Time (Seconds): 2735.98

cores: 3
Elapsed Time (Seconds): 929.64
CPU Time in User Mode (Seconds): 2766.10
CPU Time in Kernel Mode (Seconds): 0.19
Total CPU Time (Seconds): 2766.29

cores: 4
Elapsed Time (Seconds): 704.70
CPU Time in User Mode (Seconds): 2784.48
CPU Time in Kernel Mode (Seconds): 0.23
Total CPU Time (Seconds): 2784.71

cores: 5
Elapsed Time (Seconds): 568.64
CPU Time in User Mode (Seconds): 2797.36
CPU Time in Kernel Mode (Seconds): 0.19
Total CPU Time (Seconds): 2797.55

cores: 6
Elapsed Time (Seconds): 476.54
CPU Time in User Mode (Seconds): 2801.04
CPU Time in Kernel Mode (Seconds): 0.12
Total CPU Time (Seconds): 2801.17

cores: 7
Elapsed Time (Seconds): 410.95
CPU Time in User Mode (Seconds): 2807.19
CPU Time in Kernel Mode (Seconds): 0.19
Total CPU Time (Seconds): 2807.38

cores: 8
Elapsed Time (Seconds): 361.45
CPU Time in User Mode (Seconds): 2809.34
CPU Time in Kernel Mode (Seconds): 0.27
Total CPU Time (Seconds): 2809.61

After we quickly examined the timing result, we could find that elapsed time was almost linearly reduced when increasing cores. We summarize the timing data to have speedup and efficiency:

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	2739.05	1.00	100.00
2	1373.54	1.99	99.7
3	929.64	2.95	98.21
4	704.70	3.89	97.17
5	568.64	4.82	96.34
6	476.54	5.75	95.80
7	410.95	6.67	95.22
8	361.45	7.58	94.72

On 2 cores, speedup was up to 1.99x, which is equivalent to a 99.71% of efficiency; on 8 cores, grandpa could speed up to 7.58x, and reached a 94.72% of efficiency.

The ancient software, grandpa LAIPE, showed us an almost perfect speedup again.