Equation Solution  
    High Performance by Design
Navigation Tree  
Home
'- - Programming Tools
'- - Parallel Computing
'- - Blog
'- - In-Situ Evaluation
'- - Numerical Analysis
'- - Structural Mechanics
|     '- - IFAS
|     '- - mCable
'- - Write Us
|     '- - Feedback  
|     '- - Info  
|     '- - Support  
|     '- - Webmaster  
'- - Privacy Policy
 
 


Implementation of Constant-Bandwidth Solver of LAIPE2 on 48 Cores


[Posted by Jenn-Ching Luo on Nov. 20, 2014 ]

      This post shows parallel performance of a LAIPE2 solver on a 48-core computer. Multicore can speed up a computing. Some people might like to see how fast a multicore computer can improve. This post will present a set of timing results to show the performance.

      The test was implemented on a Dell Poweredge R815 with four 1.9GHZ Opterons, each of which has 12 cores. Only the basic core of Microsoft Server 2008 was installed on the computer, without graphical user interface, e.g., a command-driven environment.

      The testing problem was a sparse, symmetric, and positive definite system of equations, [A]{X}={B}, where [A] is of order (50,000x50,000) and has a constant bandwidth 7,000.

      The solution invoked laipe$decompose_CSP_10 and laipe$substitute_CSP_10. The subroutine laipe$decompose_CSP_10 decomposes matrix [A] into a product of triangular matrices; The subroutine laipe$substitute_CSP_10 performs substitutions. Substitution is not a time consuming procedure, and is less interested. This post only shows the timing results which were spent in the subroutine laipe$decompose_CSP_10.


QUICK LOOK AT THE IMPRESSIVE PERFORMANCE

      From the timing results below, it can be seen one core took 13878.71 seconds to solve the testing problem, which is about 3 hours and 52 minutes. It was a long computing. Complex engineering or scientific problems could be formulated into a system with more than 100,000 unknowns, which could request a time longer than the one in this testing problem. Efficient parallel solvers to speed up a computation are always desired.

      An interesting question is how fast 48 cores can solve the testing problem. We are going to see it.

      From the timing results below, we can see that 48 cores took only 332.84 seconds, about 5 minutes and 30 seconds. Wow! An 5-hour computing on one core could be done in about 5 minutes on 48 cores. It is very clear for us to see that there is no reason to deny parallel computing. Undoubtedly, we can benefit from multicore and parallel programming.

      One thing is mentioned here. Parallel computing is not everything, either. Sometimes, parallel computing could be an extra burden, and cannot gain a benefit. It could happen when the problem size is small, having no much computations that are able to be distributed among cooperative cores. In such situation, parallel computing cannot gain a benefit, but in contrary may slow a computing.

TIMING RESULTS

      Detailed timing results are listed in the following. The results include elapsed time, CPU time in user mode and in kernel mode, and the total CPU time. The computer when solving the testing problem did not have other running jobs, except the operating system. Under the circumstance, the computer was deemed as standalone. The elapsed time could be applied to measure speedup and efficiency.

      In general, the total CPU time should be increased when using more cores. More cores require a relatively higher overhead and more synchronizations. However, a "good luck" may reduce the total CPU time. Before an instruction is excuted, data should be loaded onto cache memory. When a core demands a specific data and loads the data onto cache memory, other lucky cores that also demand the data may access the data without a need to load the data. Total CPU time may be reduced if such "good luck" is possible for most of the time. When such "good luck" comes, we could see "super linear performance".

      Besides "good luck", we also may experience "bad luck" from cache coherency that could degrade parallel performance.

      From the above briefly description, the following timing results are understandable. The total CPU time was not in an increasing order.

#
of Cores
Elapsed
Time
(sec.)
CPU Time
in User
Mode (sec.)
CPU Time
in Kernel
Mode (sec.)
Total
CPU Time
(sec.)
1 13878.71 13877.04 1.67 13878.71
2 6317.14 12629.31 1.98 12631.29
3 4234.10 12693.50 2.11 12695.61
4 3193.19 12759.43 2.31 12761.74
5 2567.09 12818.01 2.57 12820.58
6 2148.12 12867.32 2.96 12870.29
7 1852.99 12942.69 2.84 12945.52
8 1633.27 13030.03 3.01 13033.04
9 1462.84 13124.40 3.09 13127.48
10 1326.31 13215.48 3.04 13218.53
1 1216.07 13313.34 3.29 13316.64
12 1122.52 13410.52 3.56 13414.07
13 1045.47 13515.02 3.95 13518.97
14 979.37 13628.78 3.62 13632.40
15 923.06 13752.11 3.70 13755.81
16 873.25 13873.36 3.96 13877.32
17 833.70 14036.14 3.90 14040.04
18 797.43 14208.98 4.41 14213.39
19 748.90 14097.19 4.32 14101.51
20 714.03 14133.80 4.23 14138.03
21 682.24 14151.40 4.66 14156.06
22 651.62 14145.34 4.35 14149.70
23 622.76 14140.23 4.63 14144.86
24 596.58 14129.90 4.65 14134.55
25 572.91 14132.85 4.87 14137.72
26 555.04 14184.20 5.37 14189.57
27 538.87 14235.92 4.91 14240.83
28 522.49 14277.27 5.12 14282.39
29 505.71 14300.58 5.43 14306.01
30 488.50 14321.48 4.84 14326.32
31 470.47 14286.46 5.21 14291.67
32 456.58 14280.63 5.71 14286.34
33 443.61 14274.22 5.26 14279.47
34 430.88 14273.89 6.01 14279.89
35 418.61 14289.91 5.58 14295.50
36 407.52 14307.51 5.49 14313.00
37 400.33 14361.81 6.02 14367.83
38 392.30 14492.29 5.88 14498.17
39 384.26 14460.33 6.26 14466.58
40 378.27 14539.96 6.12 14546.08
41 372.47 14620.93 6.46 14627.39
42 366.40 14678.45 6.35 14684.80
43 360.80 14743.30 7.10 14750.39
44 355.57 14836.85 6.21 14843.06
45 349.60 14898.11 6.69 14904.80
46 343.76 14971.68 6.33 14978.02
47 337.95 15031.01 6.97 15037.98
48 332.84 15135.08 6.97 15142.05

SPEEDUP AND EFFICIENCY

      The above timing results are summarized into the following table.

Number
of Cores
Elapsed
Time (Sec.)
Speedup Efficiency
(%)
1 13878.71 1.0000 100.00
2 6317.14 2.1970 109.85
3 4234.10 3.2778 109.26
4 3193.19 4.3463 108.66
5 2567.09 5.4064 108.13
6 2148.12 6.4609 107.68
7 1852.99 7.4899 107.00
8 1633.27 8.4975 106.22
9 1462.84 9.4875 105.42
10 1326.31 10.4642 104.64
11 1216.07 11.4128 103.75
12 1122.52 12.3639 103.03
13 1045.47 13.2751 102.12
14 979.37 14.1711 101.22
15 923.06 15.0355 100.24
16 873.25 15.8932 99.33
17 833.70 16.6471 97.92
18 797.43 17.4043 96.69
19 748.90 18.5321 97.54
20 714.03 19.4371 97.19
21 682.24 20.3429 96.87
22 651.62 21.2988 96.81
23 622.76 22.2858 96.89
24 596.58 23.2638 96.93
25 572.91 24.2249 96.90
26 555.04 25.0049 96.17
27 538.87 25.7552 95.39
28 522.49 26.5626 94.87
29 505.71 27.4440 94.63
30 488.50 28.4109 94.70
31 470.47 29.4997 95.16
32 456.58 30.3971 94.99
33 443.61 31.2858 94.81
34 430.88 32.2102 94.74
35 418.61 33.1543 94.73
36 407.52 34.0565 94.60
37 400.33 34.6682 93.70
38 392.30 35.3778 93.10
39 384.26 36.1180 92.61
40 378.27 36.6900 91.73
41 372.47 37.2613 90.88
42 366.40 37.8786 90.19
43 360.80 38.4665 89.46
44 355.57 39.0323 88.71
45 349.60 39.6988 88.22
46 343.76 40.3733 87.77
47 337.95 41.0673 87.38
48 332.84 41.6978 86.87

      The first column is the number of cores that solved the testing problem; The second column is the elapsed time in seconds. From the table, we can see the elapsed time is in a decreasing order. The computing was speeded up with more cores; The third column is the speedup. From the speedup, we see super linear performance in the ranges of 2 to 15 cores. In super linear performance, speedup is greater than the physical cores. For example, 2 cores speed up to 2.1970x. That looks illogical, but it happens and benefits from caching; The fourth column is the efficiency. We also see super linear performance in the range of 2 to 15 cores from efficiency. In super linear performance, efficiency is higher than 100%. In this testing problem, 48 cores can yield an efficiency of 86.87, which is highly efficient.