Equation Solution High Performance by Design |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Parallel Dense Solver on 64 Cores
[Posted by Jenn-Ching Luo on Mar. 23, 2016 ]
This post shares a performance of a LAIPE2 parallel dense solver on 64 cores. Parallel computing is a trend. In year 2016, who uses a single core computer? Possibly, no one uses a single core computer. All the computers are equipped with multicore processor. The point is how fast a multicore could speed up a computing. Not everyone has an idea how fast a multicore can improve. If you don't have an idea about it, this post would show you that 64 cores could improve the computing speed about 53 times faster than what one core could do. The detailed timing results are included in this post.
A Dell PowerEdge R815 with quad 16-core Opteron 6276 on Windows Server 2008 R2, a total of 64 cores, implemented the computing. Because Opteron 6276 can run at a higher frequency when using 8 or less cores. For the purpose to measure parallel performance, processor's turbo boost was disabled. The timing result was obtained from implementing the LAIPE2 subroutine laipe$Decompose_DAG_z16, which is a parallel dense solver for system equations. The example matrix is 32-byte complex matrix of order 4,000-by-4,000, e.g., both real and imaginary parts of the complex number are 16-byte variable. 32-byte complex arithmetic is extremely slow on current computer. Don't be surprised that one core took about 4 hours and 20 minutes to decompose the "small" matrix. It is extremely slow. First, let us see the timing result in the following. Timing Result
Core: 1
Elapsed Time (seconds): 15714.46 CPU Time in User Mode (seconds): 15713.43 CPU Time in Kernel Mode (seconds): 1.00 Total CPU Time (seconds): 15714.43 Cores: 2 Elapsed Time (seconds): 7883.67 CPU Time in User Mode (seconds): 15742.51 CPU Time in Kernel Mode (seconds): 0.98 Total CPU Time (seconds): 15743.50 Cores: 3 Elapsed Time (seconds): 5519.60 CPU Time in User Mode (seconds): 16503.66 CPU Time in Kernel Mode (seconds): 1.39 Total CPU Time (seconds): 16505.05 Cores: 4 Elapsed Time (seconds): 4252.40 CPU Time in User Mode (seconds): 16934.08 CPU Time in Kernel Mode (seconds): 1.31 Total CPU Time (seconds): 16935.39 Cores: 5 Elapsed Time (seconds): 3356.05 CPU Time in User Mode (seconds): 16683.14 CPU Time in Kernel Mode (seconds): 1.48 Total CPU Time (seconds): 16684.62 Cores: 6 Elapsed Time (seconds): 2845.16 CPU Time in User Mode (seconds): 16947.50 CPU Time in Kernel Mode (seconds): 1.39 Total CPU Time (seconds): 16948.88 Cores: 7 Elapsed Time (seconds): 2416.47 CPU Time in User Mode (seconds): 16769.44 CPU Time in Kernel Mode (seconds): 1.26 Total CPU Time (seconds): 16770.70 Cores: 8 Elapsed Time (seconds): 2143.41 CPU Time in User Mode (seconds): 16972.97 CPU Time in Kernel Mode (seconds): 2.09 Total CPU Time (seconds): 16975.06 Cores: 9 Elapsed Time (seconds): 1892.39 CPU Time in User Mode (seconds): 16831.20 CPU Time in Kernel Mode (seconds): 1.68 Total CPU Time (seconds): 16832.88 Cores: 10 Elapsed Time (seconds): 1705.95 CPU Time in User Mode (seconds): 16834.52 CPU Time in Kernel Mode (seconds): 1.78 Total CPU Time (seconds): 16836.30 Cores: 11 Elapsed Time (seconds): 1558.08 CPU Time in User Mode (seconds): 16888.95 CPU Time in Kernel Mode (seconds): 1.51 Total CPU Time (seconds): 16890.46 Cores: 12 Elapsed Time (seconds): 1441.40 CPU Time in User Mode (seconds): 17019.33 CPU Time in Kernel Mode (seconds): 1.97 Total CPU Time (seconds): 17021.30 Cores: 13 Elapsed Time (seconds): 1325.51 CPU Time in User Mode (seconds): 16927.25 CPU Time in Kernel Mode (seconds): 1.72 Total CPU Time (seconds): 16928.96 Cores: 14 Elapsed Time (seconds): 1240.40 CPU Time in User Mode (seconds): 17030.44 CPU Time in Kernel Mode (seconds): 2.22 Total CPU Time (seconds): 17032.66 Cores: 15 Elapsed Time (seconds): 1154.30 CPU Time in User Mode (seconds): 16954.66 CPU Time in Kernel Mode (seconds): 1.87 Total CPU Time (seconds): 16956.53 Cores: 16 Elapsed Time (seconds): 1089.39 CPU Time in User Mode (seconds): 17049.07 CPU Time in Kernel Mode (seconds): 1.76 Total CPU Time (seconds): 17050.83 Cores: 17 Elapsed Time (seconds): 1023.05 CPU Time in User Mode (seconds): 16979.24 CPU Time in Kernel Mode (seconds): 1.84 Total CPU Time (seconds): 16981.08 Cores: 18 Elapsed Time (seconds): 970.87 CPU Time in User Mode (seconds): 17039.15 CPU Time in Kernel Mode (seconds): 2.37 Total CPU Time (seconds): 17041.52 Cores: 19 Elapsed Time (seconds): 919.42 CPU Time in User Mode (seconds): 17001.32 CPU Time in Kernel Mode (seconds): 2.15 Total CPU Time (seconds): 17003.47 Cores: 20 Elapsed Time (seconds): 878.44 CPU Time in User Mode (seconds): 17072.17 CPU Time in Kernel Mode (seconds): 2.62 Total CPU Time (seconds): 17074.79 Cores: 21 Elapsed Time (seconds): 834.67 CPU Time in User Mode (seconds): 17004.97 CPU Time in Kernel Mode (seconds): 2.25 Total CPU Time (seconds): 17007.21 Cores: 22 Elapsed Time (seconds): 800.75 CPU Time in User Mode (seconds): 17061.24 CPU Time in Kernel Mode (seconds): 2.81 Total CPU Time (seconds): 17064.04 Cores: 23 Elapsed Time (seconds): 765.17 CPU Time in User Mode (seconds): 17013.89 CPU Time in Kernel Mode (seconds): 2.20 Total CPU Time (seconds): 17016.09 Cores: 24 Elapsed Time (seconds): 736.03 CPU Time in User Mode (seconds): 17064.64 CPU Time in Kernel Mode (seconds): 2.51 Total CPU Time (seconds): 17067.15 Cores: 25 Elapsed Time (seconds): 707.09 CPU Time in User Mode (seconds): 17036.21 CPU Time in Kernel Mode (seconds): 2.54 Total CPU Time (seconds): 17038.76 Cores: 26 Elapsed Time (seconds): 680.84 CPU Time in User Mode (seconds): 17034.64 CPU Time in Kernel Mode (seconds): 2.85 Total CPU Time (seconds): 17037.49 Cores: 27 Elapsed Time (seconds): 656.79 CPU Time in User Mode (seconds): 17039.01 CPU Time in Kernel Mode (seconds): 2.89 Total CPU Time (seconds): 17041.89 Cores: 28 Elapsed Time (seconds): 636.31 CPU Time in User Mode (seconds): 17089.71 CPU Time in Kernel Mode (seconds): 2.84 Total CPU Time (seconds): 17092.55 Cores: 29 Elapsed Time (seconds): 613.72 CPU Time in User Mode (seconds): 17042.36 CPU Time in Kernel Mode (seconds): 2.82 Total CPU Time (seconds): 17045.19 Cores: 30 Elapsed Time (seconds): 595.14 CPU Time in User Mode (seconds): 17071.75 CPU Time in Kernel Mode (seconds): 2.73 Total CPU Time (seconds): 17074.48 Cores: 31 Elapsed Time (seconds): 575.69 CPU Time in User Mode (seconds): 17039.79 CPU Time in Kernel Mode (seconds): 2.92 Total CPU Time (seconds): 17042.71 Cores: 32 Elapsed Time (seconds): 558.67 CPU Time in User Mode (seconds): 17044.44 CPU Time in Kernel Mode (seconds): 3.06 Total CPU Time (seconds): 17047.49 Cores: 33 Elapsed Time (seconds): 542.85 CPU Time in User Mode (seconds): 17048.38 CPU Time in Kernel Mode (seconds): 2.96 Total CPU Time (seconds): 17051.35 Cores: 34 Elapsed Time (seconds): 528.19 CPU Time in User Mode (seconds): 17056.03 CPU Time in Kernel Mode (seconds): 3.15 Total CPU Time (seconds): 17059.18 Cores: 35 Elapsed Time (seconds): 513.77 CPU Time in User Mode (seconds): 17045.25 CPU Time in Kernel Mode (seconds): 3.26 Total CPU Time (seconds): 17048.51 Cores: 36 Elapsed Time (seconds): 500.64 CPU Time in User Mode (seconds): 17051.61 CPU Time in Kernel Mode (seconds): 3.20 Total CPU Time (seconds): 17054.81 Cores: 37 Elapsed Time (seconds): 487.21 CPU Time in User Mode (seconds): 17038.32 CPU Time in Kernel Mode (seconds): 3.68 Total CPU Time (seconds): 17042.00 Cores: 38 Elapsed Time (seconds): 475.32 CPU Time in User Mode (seconds): 17034.22 CPU Time in Kernel Mode (seconds): 3.28 Total CPU Time (seconds): 17037.49 Cores: 39 Elapsed Time (seconds): 463.18 CPU Time in User Mode (seconds): 17015.95 CPU Time in Kernel Mode (seconds): 3.49 Total CPU Time (seconds): 17019.44 Cores: 40 Elapsed Time (seconds): 452.53 CPU Time in User Mode (seconds): 17015.58 CPU Time in Kernel Mode (seconds): 3.70 Total CPU Time (seconds): 17019.27 Cores: 41 Elapsed Time (seconds): 442.43 CPU Time in User Mode (seconds): 17028.63 CPU Time in Kernel Mode (seconds): 4.13 Total CPU Time (seconds): 17032.77 Cores: 42 Elapsed Time (seconds): 433.28 CPU Time in User Mode (seconds): 17047.23 CPU Time in Kernel Mode (seconds): 3.45 Total CPU Time (seconds): 17050.68 Cores: 43 Elapsed Time (seconds): 423.54 CPU Time in User Mode (seconds): 17029.97 CPU Time in Kernel Mode (seconds): 3.62 Total CPU Time (seconds): 17033.59 Cores: 44 Elapsed Time (seconds): 414.92 CPU Time in User Mode (seconds): 17052.77 CPU Time in Kernel Mode (seconds): 3.46 Total CPU Time (seconds): 17056.23 Cores: 45 Elapsed Time (seconds): 405.81 CPU Time in User Mode (seconds): 17037.18 CPU Time in Kernel Mode (seconds): 3.93 Total CPU Time (seconds): 17041.11 Cores: 46 Elapsed Time (seconds): 397.91 CPU Time in User Mode (seconds): 17029.04 CPU Time in Kernel Mode (seconds): 3.95 Total CPU Time (seconds): 17032.99 Cores: 47 Elapsed Time (seconds): 390.13 CPU Time in User Mode (seconds): 17041.39 CPU Time in Kernel Mode (seconds): 3.51 Total CPU Time (seconds): 17044.90 Cores: 48 Elapsed Time (seconds): 382.56 CPU Time in User Mode (seconds): 17033.44 CPU Time in Kernel Mode (seconds): 4.43 Total CPU Time (seconds): 17037.87 Cores: 49 Elapsed Time (seconds): 375.64 CPU Time in User Mode (seconds): 17068.04 CPU Time in Kernel Mode (seconds): 3.87 Total CPU Time (seconds): 17071.91 Cores: 50 Elapsed Time (seconds): 369.22 CPU Time in User Mode (seconds): 17049.27 CPU Time in Kernel Mode (seconds): 4.38 Total CPU Time (seconds): 17053.65 Cores: 51 Elapsed Time (seconds): 362.67 CPU Time in User Mode (seconds): 17062.06 CPU Time in Kernel Mode (seconds): 4.46 Total CPU Time (seconds): 17066.53 Cores: 52 Elapsed Time (seconds): 356.31 CPU Time in User Mode (seconds): 17070.77 CPU Time in Kernel Mode (seconds): 4.01 Total CPU Time (seconds): 17074.78 Cores: 53 Elapsed Time (seconds): 349.83 CPU Time in User Mode (seconds): 17054.36 CPU Time in Kernel Mode (seconds): 4.37 Total CPU Time (seconds): 17058.72 Cores: 54 Elapsed Time (seconds): 344.06 CPU Time in User Mode (seconds): 17064.67 CPU Time in Kernel Mode (seconds): 3.96 Total CPU Time (seconds): 17068.63 Cores: 55 Elapsed Time (seconds): 338.35 CPU Time in User Mode (seconds): 17066.43 CPU Time in Kernel Mode (seconds): 4.27 Total CPU Time (seconds): 17070.71 Cores: 56 Elapsed Time (seconds): 333.06 CPU Time in User Mode (seconds): 17057.20 CPU Time in Kernel Mode (seconds): 4.10 Total CPU Time (seconds): 17061.30 Cores: 57 Elapsed Time (seconds): 327.98 CPU Time in User Mode (seconds): 17092.94 CPU Time in Kernel Mode (seconds): 4.43 Total CPU Time (seconds): 17097.37 Cores: 58 Elapsed Time (seconds): 323.33 CPU Time in User Mode (seconds): 17082.69 CPU Time in Kernel Mode (seconds): 5.10 Total CPU Time (seconds): 17087.79 Cores: 59 Elapsed Time (seconds): 318.15 CPU Time in User Mode (seconds): 17081.35 CPU Time in Kernel Mode (seconds): 4.73 Total CPU Time (seconds): 17086.07 Cores: 60 Elapsed Time (seconds): 313.28 CPU Time in User Mode (seconds): 17071.66 CPU Time in Kernel Mode (seconds): 4.06 Total CPU Time (seconds): 17075.71 Cores: 61 Elapsed Time (seconds): 308.91 CPU Time in User Mode (seconds): 17077.71 CPU Time in Kernel Mode (seconds): 4.96 Total CPU Time (seconds): 17082.67 Cores: 62 Elapsed Time (seconds): 304.55 CPU Time in User Mode (seconds): 17101.91 CPU Time in Kernel Mode (seconds): 5.09 Total CPU Time (seconds): 17106.99 Cores: 63 Elapsed Time (seconds): 300.51 CPU Time in User Mode (seconds): 17109.16 CPU Time in Kernel Mode (seconds): 4.52 Total CPU Time (seconds): 17113.68 Cores: 64 Elapsed Time (seconds): 296.82 CPU Time in User Mode (seconds): 17136.60 CPU Time in Kernel Mode (seconds): 4.85 Total CPU Time (seconds): 17141.45 From the above timing result, the first thing to be noted is the time that one core took to decompose the matrix. We can see from the above list that one core took 15714.46 seconds, (e.g, about 4 hours and 20 minutes), to decompose the matrix. That is extremely slow. Second, let us examine how much time 64 cores required for the decomposition of the matrix. The timing result shows 64 cores took 296.82 seconds. 64 cores allow us to get the solution in less than 5 minutes. As compared with 4 hours and 20 minutes for one core to decompose the matrix, we can get the solution in less than 5 minutes on 64 cores. This timing result shows no reason to reject multicore applications, even which is relatively difficult in development. Third, the elapsed time has not reached a limit yet. In parallel computing, it is possible that elapsed time could not be improved, or even got worse, when using more additional cores, e.g., reaching a limit. However, that did not happen in this example. From the above list, we can see elapsed time was reduced with enabling an additional core. That means if more cores were available, the computing speed could be further improved. The detailed speedup and efficiency are as follows. Speedup and Efficiency
The above table includes four columns. The first column is number of cores; The second column is the elapsed time in seconds; The third column is speedup; The fourth column is efficiency. Our interest is speedup and efficiency. The following notes two points. First, the example shows an unusual performance. Normally, efficiency is in a decreasing order. However, from the above table, we can see the efficiency is not completely in a decreasing order. For example, from the above table, we can see efficiency of four cores is 92.39%; Supposedly, five cores could yield an efficiency lower than what four cores could produce. However, the efficiency of five cores is 93.65%, which is higher than the efficiency of four cores. That is unusual. At this moment, it is uncertain what the actual cause is. One explanation is the cost of accessing memory. Second, 64 cores had improved the computing speed about 53x, and yielded an efficiency of 83%. This example provides an answer how fast a multicore could improve. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||