Implementation of Constant-Bandwidth Solver of LAIPE2 on 48 Cores

[Posted by Jenn-Ching Luo on Nov. 20, 2014 ]

        This post shows parallel performance of a LAIPE2 solver on a 48-core computer. Multicore can speed up a computing. Some people might like to see how fast a multicore computer can improve. This post will present a set of timing results to show the performance.

        The test was implemented on a Dell Poweredge R815 with four 1.9GHZ Opterons, each of which has 12 cores. Only the basic core of Microsoft Server 2008 was installed on the computer, without graphical user interface, e.g., a command-driven environment.

        The testing problem was a sparse, symmetric, and positive definite system of equations, [A]{X}={B}, where [A] is of order (50,000x50,000) and has a constant bandwidth 7,000.

        The solution invoked laipe$decompose_CSP_10 and laipe$substitute_CSP_10. The subroutine laipe$decompose_CSP_10 decomposes matrix [A] into a product of triangular matrices; The subroutine laipe$substitute_CSP_10 performs substitutions. Substitution is not a time consuming procedure, and is less interested. This post only shows the timing results which were spent in the subroutine laipe$decompose_CSP_10.

QUICK LOOK AT THE IMPRESSIVE PERFORMANCE

        From the timing results below, it can be seen one core took 13878.71 seconds to solve the testing problem, which is about 3 hours and 52 minutes. It was a long computing. Complex engineering or scientific problems could be formulated into a system with more than 100,000 unknowns, which could request a time longer than the one in this testing problem. Efficient parallel solvers to speed up a computation are always desired.

        An interesting question is how fast 48 cores can solve the testing problem. We are going to see it.

        From the timing results below, we can see that 48 cores took only 332.84 seconds, about 5 minutes and 30 seconds. Wow! An 5-hour computing on one core could be done in about 5 minutes on 48 cores. It is very clear for us to see that there is no reason to deny parallel computing. Undoubtedly, we can benefit from multicore and parallel programming.

        One thing is mentioned here. Parallel computing is not everything, either. Sometimes, parallel computing could be an extra burden, and cannot gain a benefit. It could happen when the problem size is small, having no much computations that are able to be distributed among cooperative cores. In such situation, parallel computing cannot gain a benefit, but in contrary may slow a computing.

TIMING RESULTS

        Detailed timing results are listed in the following. The results include elapsed time, CPU time in user mode and in kernel mode, and the total CPU time. The computer when solving the testing problem did not have other running jobs, except the operating system. Under the circumstance, the computer was deemed as standalone. The elapsed time could be applied to measure speedup and efficiency.

        In general, the total CPU time should be increased when using more cores. More cores require a relatively higher overhead and more synchronizations. However, a "good luck" may reduce the total CPU time. Before an instruction is excuted, data should be loaded onto cache memory. When a core demands a specific data and loads the data onto cache memory, other lucky cores that also demand the data may access the data without a need to load the data. Total CPU time may be reduced if such "good luck" is possible for most of the time. When such "good luck" comes, we could see "super linear performance".

        Besides "good luck", we also may experience "bad luck" from cache coherency that could degrade parallel performance.

        From the above briefly description, the following timing results are understandable. The total CPU time was not in an increasing order.

# of Cores	Elapsed Time (sec.)	CPU Time in User Mode (sec.)	CPU Time in Kernel Mode (sec.)	Total CPU Time (sec.)
1	13878.71	13877.04	1.67	13878.71
2	6317.14	12629.31	1.98	12631.29
3	4234.10	12693.50	2.11	12695.61
4	3193.19	12759.43	2.31	12761.74
5	2567.09	12818.01	2.57	12820.58
6	2148.12	12867.32	2.96	12870.29
7	1852.99	12942.69	2.84	12945.52
8	1633.27	13030.03	3.01	13033.04
9	1462.84	13124.40	3.09	13127.48
10	1326.31	13215.48	3.04	13218.53
1	1216.07	13313.34	3.29	13316.64
12	1122.52	13410.52	3.56	13414.07
13	1045.47	13515.02	3.95	13518.97
14	979.37	13628.78	3.62	13632.40
15	923.06	13752.11	3.70	13755.81
16	873.25	13873.36	3.96	13877.32
17	833.70	14036.14	3.90	14040.04
18	797.43	14208.98	4.41	14213.39
19	748.90	14097.19	4.32	14101.51
20	714.03	14133.80	4.23	14138.03
21	682.24	14151.40	4.66	14156.06
22	651.62	14145.34	4.35	14149.70
23	622.76	14140.23	4.63	14144.86
24	596.58	14129.90	4.65	14134.55
25	572.91	14132.85	4.87	14137.72
26	555.04	14184.20	5.37	14189.57
27	538.87	14235.92	4.91	14240.83
28	522.49	14277.27	5.12	14282.39
29	505.71	14300.58	5.43	14306.01
30	488.50	14321.48	4.84	14326.32
31	470.47	14286.46	5.21	14291.67
32	456.58	14280.63	5.71	14286.34
33	443.61	14274.22	5.26	14279.47
34	430.88	14273.89	6.01	14279.89
35	418.61	14289.91	5.58	14295.50
36	407.52	14307.51	5.49	14313.00
37	400.33	14361.81	6.02	14367.83
38	392.30	14492.29	5.88	14498.17
39	384.26	14460.33	6.26	14466.58
40	378.27	14539.96	6.12	14546.08
41	372.47	14620.93	6.46	14627.39
42	366.40	14678.45	6.35	14684.80
43	360.80	14743.30	7.10	14750.39
44	355.57	14836.85	6.21	14843.06
45	349.60	14898.11	6.69	14904.80
46	343.76	14971.68	6.33	14978.02
47	337.95	15031.01	6.97	15037.98
48	332.84	15135.08	6.97	15142.05

SPEEDUP AND EFFICIENCY

The above timing results are summarized into the following table.

Number of Cores	Elapsed Time (Sec.)	Speedup	Efficiency (%)
1	13878.71	1.0000	100.00
2	6317.14	2.1970	109.85
3	4234.10	3.2778	109.26
4	3193.19	4.3463	108.66
5	2567.09	5.4064	108.13
6	2148.12	6.4609	107.68
7	1852.99	7.4899	107.00
8	1633.27	8.4975	106.22
9	1462.84	9.4875	105.42
10	1326.31	10.4642	104.64
11	1216.07	11.4128	103.75
12	1122.52	12.3639	103.03
13	1045.47	13.2751	102.12
14	979.37	14.1711	101.22
15	923.06	15.0355	100.24
16	873.25	15.8932	99.33
17	833.70	16.6471	97.92
18	797.43	17.4043	96.69
19	748.90	18.5321	97.54
20	714.03	19.4371	97.19
21	682.24	20.3429	96.87
22	651.62	21.2988	96.81
23	622.76	22.2858	96.89
24	596.58	23.2638	96.93
25	572.91	24.2249	96.90
26	555.04	25.0049	96.17
27	538.87	25.7552	95.39
28	522.49	26.5626	94.87
29	505.71	27.4440	94.63
30	488.50	28.4109	94.70
31	470.47	29.4997	95.16
32	456.58	30.3971	94.99
33	443.61	31.2858	94.81
34	430.88	32.2102	94.74
35	418.61	33.1543	94.73
36	407.52	34.0565	94.60
37	400.33	34.6682	93.70
38	392.30	35.3778	93.10
39	384.26	36.1180	92.61
40	378.27	36.6900	91.73
41	372.47	37.2613	90.88
42	366.40	37.8786	90.19
43	360.80	38.4665	89.46
44	355.57	39.0323	88.71
45	349.60	39.6988	88.22
46	343.76	40.3733	87.77
47	337.95	41.0673	87.38
48	332.84	41.6978	86.87

The first column is the number of cores that solved the testing problem; The second column is the elapsed time in seconds. From the table, we can see the elapsed time is in a decreasing order. The computing was speeded up with more cores; The third column is the speedup. From the speedup, we see super linear performance in the ranges of 2 to 15 cores. In super linear performance, speedup is greater than the physical cores. For example, 2 cores speed up to 2.1970x. That looks illogical, but it happens and benefits from caching; The fourth column is the efficiency. We also see super linear performance in the range of 2 to 15 cores from efficiency. In super linear performance, efficiency is higher than 100%. In this testing problem, 48 cores can yield an efficiency of 86.87, which is highly efficient.