Equation Solution High Performance by Design |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Parallel Performance of laipe$Decompose_DAG_16 on 48 Cores
[Posted by Jenn-Ching Luo on Mar. 16, 2016 ]
This post shares a set of parallel performance of LAIPE2 subroutine laipe$Decompose_DAG_16, which is a parallel dense solver for system equations. In this set of performance, the computing had been speeded up to 36x when using 48 cores.
The example was implemented on homogeneous cores of neuLoop. A Dell PowerEdge R815 with quad 12-core 1.9GHZ Opterons on Windows Server 2008 R2, a total of 48 cores, implemented the computing. The example decomposed a 16-byte matrix [A] into [L][U]. Because 16-byte real arithmetic is very slow on the current computer, the example matrix is of order 5,000-by-5,000. Timing result is as follows: Timing Result
Core: 1
Elapsed Time (Seconds): 5786.84 CPU Time in User Mode (Seconds): 5786.64 CPU Time in Kernel Mode (Seconds): 0.20 Total CPU Time (Seconds): 5786.84 Cores: 2 Elapsed Time (Seconds): 2896.97 CPU Time in User Mode (Seconds): 5776.67 CPU Time in Kernel Mode (Seconds): 0.09 Total CPU Time (Seconds): 5776.76 Cores: 3 Elapsed Time (Seconds): 1941.53 CPU Time in User Mode (Seconds): 5775.16 CPU Time in Kernel Mode (Seconds): 0.27 Total CPU Time (Seconds): 5775.42 Cores: 4 Elapsed Time (Seconds): 1459.73 CPU Time in User Mode (Seconds): 5770.43 CPU Time in Kernel Mode (Seconds): 0.45 Total CPU Time (Seconds): 5770.88 Cores: 5 Elapsed Time (Seconds): 1174.95 CPU Time in User Mode (Seconds): 5767.79 CPU Time in Kernel Mode (Seconds): 0.76 Total CPU Time (Seconds): 5768.56 Cores: 6 Elapsed Time (Seconds): 983.13 CPU Time in User Mode (Seconds): 5764.30 CPU Time in Kernel Mode (Seconds): 0.92 Total CPU Time (Seconds): 5765.22 Cores: 7 Elapsed Time (Seconds): 848.18 CPU Time in User Mode (Seconds): 5770.41 CPU Time in Kernel Mode (Seconds): 0.87 Total CPU Time (Seconds): 5771.29 Cores: 8 Elapsed Time (Seconds): 745.15 CPU Time in User Mode (Seconds): 5778.98 CPU Time in Kernel Mode (Seconds): 1.17 Total CPU Time (Seconds): 5780.15 Cores: 9 Elapsed Time (Seconds): 667.59 CPU Time in User Mode (Seconds): 5779.76 CPU Time in Kernel Mode (Seconds): 1.29 Total CPU Time (Seconds): 5781.05 Cores: 10 Elapsed Time (Seconds): 604.25 CPU Time in User Mode (Seconds): 5788.20 CPU Time in Kernel Mode (Seconds): 1.14 Total CPU Time (Seconds): 5789.34 Cores: 11 Elapsed Time (Seconds): 552.85 CPU Time in User Mode (Seconds): 5795.03 CPU Time in Kernel Mode (Seconds): 1.28 Total CPU Time (Seconds): 5796.31 Cores: 12 Elapsed Time (Seconds): 508.73 CPU Time in User Mode (Seconds): 5795.78 CPU Time in Kernel Mode (Seconds): 1.53 Total CPU Time (Seconds): 5797.31 Cores: 13 Elapsed Time (Seconds): 473.29 CPU Time in User Mode (Seconds): 5797.20 CPU Time in Kernel Mode (Seconds): 0.97 Total CPU Time (Seconds): 5798.17 Cores: 14 Elapsed Time (Seconds): 441.97 CPU Time in User Mode (Seconds): 5805.08 CPU Time in Kernel Mode (Seconds): 1.98 Total CPU Time (Seconds): 5807.06 Cores: 15 Elapsed Time (Seconds): 415.27 CPU Time in User Mode (Seconds): 5809.15 CPU Time in Kernel Mode (Seconds): 2.36 Total CPU Time (Seconds): 5811.51 Cores: 16 Elapsed Time (Seconds): 390.83 CPU Time in User Mode (Seconds): 5809.98 CPU Time in Kernel Mode (Seconds): 2.95 Total CPU Time (Seconds): 5812.92 Cores: 17 Elapsed Time (Seconds): 370.89 CPU Time in User Mode (Seconds): 5816.64 CPU Time in Kernel Mode (Seconds): 2.29 Total CPU Time (Seconds): 5818.93 Cores: 18 Elapsed Time (Seconds): 352.17 CPU Time in User Mode (Seconds): 5818.68 CPU Time in Kernel Mode (Seconds): 2.32 Total CPU Time (Seconds): 5821.01 Cores: 19 Elapsed Time (Seconds): 335.81 CPU Time in User Mode (Seconds): 5822.27 CPU Time in Kernel Mode (Seconds): 2.68 Total CPU Time (Seconds): 5824.95 Cores: 20 Elapsed Time (Seconds): 320.49 CPU Time in User Mode (Seconds): 5829.04 CPU Time in Kernel Mode (Seconds): 2.42 Total CPU Time (Seconds): 5831.46 Cores: 21 Elapsed Time (Seconds): 307.60 CPU Time in User Mode (Seconds): 5831.08 CPU Time in Kernel Mode (Seconds): 2.68 Total CPU Time (Seconds): 5833.77 Cores: 22 Elapsed Time (Seconds): 295.53 CPU Time in User Mode (Seconds): 5834.83 CPU Time in Kernel Mode (Seconds): 2.06 Total CPU Time (Seconds): 5836.89 Cores: 23 Elapsed Time (Seconds): 284.52 CPU Time in User Mode (Seconds): 5840.93 CPU Time in Kernel Mode (Seconds): 2.54 Total CPU Time (Seconds): 5843.47 Cores: 24 Elapsed Time (Seconds): 273.85 CPU Time in User Mode (Seconds): 5845.76 CPU Time in Kernel Mode (Seconds): 3.63 Total CPU Time (Seconds): 5849.40 Cores: 25 Elapsed Time (Seconds): 264.95 CPU Time in User Mode (Seconds): 5849.16 CPU Time in Kernel Mode (Seconds): 3.04 Total CPU Time (Seconds): 5852.21 Cores: 26 Elapsed Time (Seconds): 256.39 CPU Time in User Mode (Seconds): 5852.08 CPU Time in Kernel Mode (Seconds): 4.34 Total CPU Time (Seconds): 5856.42 Cores: 27 Elapsed Time (Seconds): 248.62 CPU Time in User Mode (Seconds): 5856.34 CPU Time in Kernel Mode (Seconds): 4.15 Total CPU Time (Seconds): 5860.49 Cores: 28 Elapsed Time (Seconds): 240.91 CPU Time in User Mode (Seconds): 5862.89 CPU Time in Kernel Mode (Seconds): 4.38 Total CPU Time (Seconds): 5867.28 Cores: 29 Elapsed Time (Seconds): 234.45 CPU Time in User Mode (Seconds): 5862.64 CPU Time in Kernel Mode (Seconds): 4.46 Total CPU Time (Seconds): 5867.10 Cores: 30 Elapsed Time (Seconds): 228.04 CPU Time in User Mode (Seconds): 5867.53 CPU Time in Kernel Mode (Seconds): 4.77 Total CPU Time (Seconds): 5872.30 Cores: 31 Elapsed Time (Seconds): 222.25 CPU Time in User Mode (Seconds): 5877.56 CPU Time in Kernel Mode (Seconds): 4.57 Total CPU Time (Seconds): 5882.13 Cores: 32 Elapsed Time (Seconds): 216.19 CPU Time in User Mode (Seconds): 5882.50 CPU Time in Kernel Mode (Seconds): 5.85 Total CPU Time (Seconds): 5888.35 Cores: 33 Elapsed Time (Seconds): 211.37 CPU Time in User Mode (Seconds): 5884.03 CPU Time in Kernel Mode (Seconds): 4.04 Total CPU Time (Seconds): 5888.07 Cores: 34 Elapsed Time (Seconds): 206.36 CPU Time in User Mode (Seconds): 5891.78 CPU Time in Kernel Mode (Seconds): 4.54 Total CPU Time (Seconds): 5896.32 Cores: 35 Elapsed Time (Seconds): 201.90 CPU Time in User Mode (Seconds): 5895.85 CPU Time in Kernel Mode (Seconds): 3.90 Total CPU Time (Seconds): 5899.75 Cores: 36 Elapsed Time (Seconds): 197.36 CPU Time in User Mode (Seconds): 5902.24 CPU Time in Kernel Mode (Seconds): 5.54 Total CPU Time (Seconds): 5907.77 Cores: 37 Elapsed Time (Seconds): 193.61 CPU Time in User Mode (Seconds): 5909.47 CPU Time in Kernel Mode (Seconds): 6.19 Total CPU Time (Seconds): 5915.67 Cores: 38 Elapsed Time (Seconds): 189.71 CPU Time in User Mode (Seconds): 5912.98 CPU Time in Kernel Mode (Seconds): 4.71 Total CPU Time (Seconds): 5917.70 Cores: 39 Elapsed Time (Seconds): 186.00 CPU Time in User Mode (Seconds): 5918.41 CPU Time in Kernel Mode (Seconds): 5.69 Total CPU Time (Seconds): 5924.11 Cores: 40 Elapsed Time (Seconds): 182.32 CPU Time in User Mode (Seconds): 5924.47 CPU Time in Kernel Mode (Seconds): 5.12 Total CPU Time (Seconds): 5929.58 Cores: 41 Elapsed Time (Seconds): 179.28 CPU Time in User Mode (Seconds): 5923.31 CPU Time in Kernel Mode (Seconds): 5.40 Total CPU Time (Seconds): 5928.71 Cores: 42 Elapsed Time (Seconds): 175.95 CPU Time in User Mode (Seconds): 5928.37 CPU Time in Kernel Mode (Seconds): 4.90 Total CPU Time (Seconds): 5933.26 Cores: 43 Elapsed Time (Seconds): 173.08 CPU Time in User Mode (Seconds): 5932.58 CPU Time in Kernel Mode (Seconds): 5.76 Total CPU Time (Seconds): 5938.33 Cores: 44 Elapsed Time (Seconds): 170.29 CPU Time in User Mode (Seconds): 5943.11 CPU Time in Kernel Mode (Seconds): 6.46 Total CPU Time (Seconds): 5949.57 Cores: 45 Elapsed Time (Seconds): 167.65 CPU Time in User Mode (Seconds): 5942.84 CPU Time in Kernel Mode (Seconds): 7.47 Total CPU Time (Seconds): 5950.32 Cores: 46 Elapsed Time (Seconds): 165.21 CPU Time in User Mode (Seconds): 5948.08 CPU Time in Kernel Mode (Seconds): 5.87 Total CPU Time (Seconds): 5953.95 Cores: 47 Elapsed Time (Seconds): 162.93 CPU Time in User Mode (Seconds): 5953.33 CPU Time in Kernel Mode (Seconds): 7.64 Total CPU Time (Seconds): 5960.97 Cores: 48 Elapsed Time (Seconds): 160.28 CPU Time in User Mode (Seconds): 5953.36 CPU Time in Kernel Mode (Seconds): 6.85 Total CPU Time (Seconds): 5960.21 From the above, it can be seen elapsed time was reducing with more cores enabled. For example, one cores took 5786.84 seconds to complete the computing; While, two cores cut the elapsed time into 2896.97 seconds, almost half of the time one core required; Three cores completed the computing in 1941.53 seconds, almost 1/3 of the time one core required; When using 48 cores, it took only 160.28 seconds to decompose the matrix. This example took advantage of multicore. A 97-minute job (e.g., 5786.84 seconds) can be completed in about 2.5 minutes (e.g., 160.48 seconds) on 48 cores. That is a purpose that multicore is designed. We could use more cores to speed up computing.The speedup and efficiency are listed in the following. Speedup and Efficiency
The above table has four columns. The first column is number of cores; The second column is the elapsed time in seconds; The third column is speedup. Speedup is a result we are going to see. The example yielded almost perfect speedup within the range of 10 cores. For example, two cores yielded a 1.9975 speedup; three cores yielded a 2.9806 speedup; four cores yielded a 3.9643 speedup. 48 cores achieved a 36.1046 speedup. This example yielded no super-linear speedup. The fourth column is efficiency. It also can be seen that, within the range of 20 cores, the example could achieve an efficiency over 90%. With 48 cores, it could reach 75% efficiency of parallel processing. This example convinces us that efficient parallel program can be developed to take advantage of multicore. The LAIPE2 subroutines are programmed in neuLoop. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||