Equation Solution High Performance by Design |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Chunk Size and Parallel Performance
[Posted by Jenn-Ching Luo on Feb. 12, 2016 ]
This post is a follow-up of the previous post, "Parallel Performance of laipe$Decompose_DAG_4 on 48 cores". In the previous post, the computing was sped up within the range of 35 cores. Beyond the range of 35 cores, no speedup could be seen. The previous post mentioned that the problem could be improved by a "tune-up". There are some ways to improve parallel performance, one of which is to optimize chunk size.
The previous post was implemented with a chunk size of 64x64, e.g., dimension of subblock. This post uses a smaller chunk, 48x48, to show a comparison of parallel performance. We could see chunk size contributes to parallel performance. The testing example and computing environment are the same as the ones in the previous post. The testing problem is a 4-byte dense matrix of order 20,000-by-20,000. The computing environment is a Dell PowerEdge R815 with quad 1.9GHZ 12-core Opteron on Windows Server 2008. The only difference is chunk size. In this post, the dense matrix was decomposed in a base of 48-by-48 subblocks; While in the previous post, the dense matrix was decomposed in 64-by-64 subblocks. First, let us see timing result from decomposing the example matrix in 48-by-48 subblocks. Timing Result
Core: 1
Elapsed Time (Seconds): 4899.73 CPU Time in User Mode (Seconds): 4899.23 CPU Time in Kernel Mode (Seconds): 0.47 Total CPU Time (Seconds): 4899.69 Cores: 2 Elapsed Time (Seconds): 2447.19 CPU Time in User Mode (Seconds): 4881.71 CPU Time in Kernel Mode (Seconds): 0.31 Total CPU Time (Seconds): 4882.02 Cores: 3 Elapsed Time (Seconds): 1648.24 CPU Time in User Mode (Seconds): 4911.44 CPU Time in Kernel Mode (Seconds): 0.44 Total CPU Time (Seconds): 4911.88 Cores: 4 Elapsed Time (Seconds): 1245.68 CPU Time in User Mode (Seconds): 4934.02 CPU Time in Kernel Mode (Seconds): 0.45 Total CPU Time (Seconds): 4934.47 Cores: 5 Elapsed Time (Seconds): 1004.41 CPU Time in User Mode (Seconds): 4951.46 CPU Time in Kernel Mode (Seconds): 0.53 Total CPU Time (Seconds): 4951.99 Cores: 6 Elapsed Time (Seconds): 842.45 CPU Time in User Mode (Seconds): 4963.97 CPU Time in Kernel Mode (Seconds): 0.90 Total CPU Time (Seconds): 4964.87 Cores: 7 Elapsed Time (Seconds): 729.46 CPU Time in User Mode (Seconds): 4998.33 CPU Time in Kernel Mode (Seconds): 0.75 Total CPU Time (Seconds): 4999.08 Cores: 8 Elapsed Time (Seconds): 644.35 CPU Time in User Mode (Seconds): 5027.66 CPU Time in Kernel Mode (Seconds): 0.86 Total CPU Time (Seconds): 5028.52 Cores: 9 Elapsed Time (Seconds): 578.69 CPU Time in User Mode (Seconds): 5058.16 CPU Time in Kernel Mode (Seconds): 1.40 Total CPU Time (Seconds): 5059.56 Cores: 10 Elapsed Time (Seconds): 525.90 CPU Time in User Mode (Seconds): 5087.61 CPU Time in Kernel Mode (Seconds): 1.36 Total CPU Time (Seconds): 5088.97 Cores: 11 Elapsed Time (Seconds): 482.59 CPU Time in User Mode (Seconds): 5114.99 CPU Time in Kernel Mode (Seconds): 1.67 Total CPU Time (Seconds): 5116.66 Cores: 12 Elapsed Time (Seconds): 445.98 CPU Time in User Mode (Seconds): 5141.61 CPU Time in Kernel Mode (Seconds): 1.78 Total CPU Time (Seconds): 5143.38 Cores: 13 Elapsed Time (Seconds): 414.79 CPU Time in User Mode (Seconds): 5154.96 CPU Time in Kernel Mode (Seconds): 1.81 Total CPU Time (Seconds): 5156.77 Cores: 14 Elapsed Time (Seconds): 387.58 CPU Time in User Mode (Seconds): 5169.62 CPU Time in Kernel Mode (Seconds): 2.32 Total CPU Time (Seconds): 5171.95 Cores: 15 Elapsed Time (Seconds): 364.61 CPU Time in User Mode (Seconds): 5191.42 CPU Time in Kernel Mode (Seconds): 2.01 Total CPU Time (Seconds): 5193.43 Cores: 16 Elapsed Time (Seconds): 345.01 CPU Time in User Mode (Seconds): 5217.64 CPU Time in Kernel Mode (Seconds): 2.23 Total CPU Time (Seconds): 5219.87 Cores: 17 Elapsed Time (Seconds): 327.82 CPU Time in User Mode (Seconds): 5241.24 CPU Time in Kernel Mode (Seconds): 3.26 Total CPU Time (Seconds): 5244.50 Cores: 18 Elapsed Time (Seconds): 312.70 CPU Time in User Mode (Seconds): 5276.11 CPU Time in Kernel Mode (Seconds): 2.67 Total CPU Time (Seconds): 5278.78 Cores: 19 Elapsed Time (Seconds): 298.32 CPU Time in User Mode (Seconds): 5291.66 CPU Time in Kernel Mode (Seconds): 2.04 Total CPU Time (Seconds): 5293.71 Cores: 20 Elapsed Time (Seconds): 285.43 CPU Time in User Mode (Seconds): 5311.15 CPU Time in Kernel Mode (Seconds): 2.75 Total CPU Time (Seconds): 5313.89 Cores: 21 Elapsed Time (Seconds): 274.80 CPU Time in User Mode (Seconds): 5340.32 CPU Time in Kernel Mode (Seconds): 2.92 Total CPU Time (Seconds): 5343.24 Cores: 22 Elapsed Time (Seconds): 264.86 CPU Time in User Mode (Seconds): 5367.48 CPU Time in Kernel Mode (Seconds): 3.14 Total CPU Time (Seconds): 5370.62 Cores: 23 Elapsed Time (Seconds): 256.48 CPU Time in User Mode (Seconds): 5411.18 CPU Time in Kernel Mode (Seconds): 2.78 Total CPU Time (Seconds): 5413.95 Cores: 24 Elapsed Time (Seconds): 249.30 CPU Time in User Mode (Seconds): 5459.74 CPU Time in Kernel Mode (Seconds): 3.73 Total CPU Time (Seconds): 5463.47 Cores: 25 Elapsed Time (Seconds): 242.96 CPU Time in User Mode (Seconds): 5519.03 CPU Time in Kernel Mode (Seconds): 3.96 Total CPU Time (Seconds): 5523.00 Cores: 26 Elapsed Time (Seconds): 236.72 CPU Time in User Mode (Seconds): 5570.83 CPU Time in Kernel Mode (Seconds): 3.63 Total CPU Time (Seconds): 5574.46 Cores: 27 Elapsed Time (Seconds): 229.76 CPU Time in User Mode (Seconds): 5584.18 CPU Time in Kernel Mode (Seconds): 4.54 Total CPU Time (Seconds): 5588.72 Cores: 28 Elapsed Time (Seconds): 225.31 CPU Time in User Mode (Seconds): 5650.54 CPU Time in Kernel Mode (Seconds): 4.18 Total CPU Time (Seconds): 5654.72 Cores: 29 Elapsed Time (Seconds): 220.96 CPU Time in User Mode (Seconds): 5712.43 CPU Time in Kernel Mode (Seconds): 3.85 Total CPU Time (Seconds): 5716.28 Cores: 30 Elapsed Time (Seconds): 215.38 CPU Time in User Mode (Seconds): 5727.14 CPU Time in Kernel Mode (Seconds): 3.28 Total CPU Time (Seconds): 5730.42 Cores: 31 Elapsed Time (Seconds): 211.66 CPU Time in User Mode (Seconds): 5793.49 CPU Time in Kernel Mode (Seconds): 3.95 Total CPU Time (Seconds): 5797.43 Cores: 32 Elapsed Time (Seconds): 206.87 CPU Time in User Mode (Seconds): 5811.74 CPU Time in Kernel Mode (Seconds): 4.43 Total CPU Time (Seconds): 5816.17 Cores: 33 Elapsed Time (Seconds): 202.37 CPU Time in User Mode (Seconds): 5822.33 CPU Time in Kernel Mode (Seconds): 4.77 Total CPU Time (Seconds): 5827.11 Cores: 34 Elapsed Time (Seconds): 198.98 CPU Time in User Mode (Seconds): 5867.65 CPU Time in Kernel Mode (Seconds): 4.54 Total CPU Time (Seconds): 5872.19 Cores: 35 Elapsed Time (Seconds): 195.98 CPU Time in User Mode (Seconds): 5922.11 CPU Time in Kernel Mode (Seconds): 5.24 Total CPU Time (Seconds): 5927.35 Cores: 36 Elapsed Time (Seconds): 192.66 CPU Time in User Mode (Seconds): 5952.03 CPU Time in Kernel Mode (Seconds): 4.95 Total CPU Time (Seconds): 5956.98 Cores: 37 Elapsed Time (Seconds): 189.99 CPU Time in User Mode (Seconds): 6009.89 CPU Time in Kernel Mode (Seconds): 5.74 Total CPU Time (Seconds): 6015.63 Cores: 38 Elapsed Time (Seconds): 187.92 CPU Time in User Mode (Seconds): 6081.98 CPU Time in Kernel Mode (Seconds): 7.02 Total CPU Time (Seconds): 6089.00 Cores: 39 Elapsed Time (Seconds): 185.49 CPU Time in User Mode (Seconds): 6137.48 CPU Time in Kernel Mode (Seconds): 6.05 Total CPU Time (Seconds): 6143.54 Cores: 40 Elapsed Time (Seconds): 183.99 CPU Time in User Mode (Seconds): 6196.00 CPU Time in Kernel Mode (Seconds): 7.61 Total CPU Time (Seconds): 6203.61 Cores: 41 Elapsed Time (Seconds): 181.24 CPU Time in User Mode (Seconds): 6235.17 CPU Time in Kernel Mode (Seconds): 7.85 Total CPU Time (Seconds): 6243.02 Cores: 42 Elapsed Time (Seconds): 179.90 CPU Time in User Mode (Seconds): 6299.68 CPU Time in Kernel Mode (Seconds): 7.36 Total CPU Time (Seconds): 6307.04 Cores: 43 Elapsed Time (Seconds): 180.02 CPU Time in User Mode (Seconds): 6433.48 CPU Time in Kernel Mode (Seconds): 6.82 Total CPU Time (Seconds): 6440.30 Cores: 44 Elapsed Time (Seconds): 179.09 CPU Time in User Mode (Seconds): 6499.17 CPU Time in Kernel Mode (Seconds): 7.82 Total CPU Time (Seconds): 6506.99 Cores: 45 Elapsed Time (Seconds): 177.61 CPU Time in User Mode (Seconds): 6595.05 CPU Time in Kernel Mode (Seconds): 7.38 Total CPU Time (Seconds): 6602.43 Cores: 46 Elapsed Time (Seconds): 177.93 CPU Time in User Mode (Seconds): 6691.76 CPU Time in Kernel Mode (Seconds): 9.14 Total CPU Time (Seconds): 6700.90 Cores: 47 Elapsed Time (Seconds): 176.17 CPU Time in User Mode (Seconds): 6738.68 CPU Time in Kernel Mode (Seconds): 8.81 Total CPU Time (Seconds): 6747.50 Cores: 48 Elapsed Time (Seconds): 175.80 CPU Time in User Mode (Seconds): 6830.43 CPU Time in Kernel Mode (Seconds): 7.30 Total CPU Time (Seconds): 6837.73 Elapsed time almost could be reduced when enabling one more core, except when enabling 43 and 46 cores. For example, 42 cores took 179.90 seconds. Logically, 43 cores would take less than 179.90 seconds. However, 43 cores took 180.02 seconds. More cores did not speed up, but slowed it down. One potential cause to the illogical result is memory cache. From the above results, it also can be seen that in most situations the total CPU time was increased with the number of cores enabled. For example, one core took 4899.69 seconds to decompose the matrix, and 48 cores took 6837.73 seconds for the same job. The increasing CPU time is to pay for cache coherence, which is the main factor to degrade parallel performance in memory sharing environments. Speedup and efficiency are listed in the following. Speedup and Efficiency
From the above table, it can be seen that the LAIPE2 subroutine laipe$Decompose_DAG_4 can yield an efficiency above 90% within the range of 14 cores in this example; While, efficiency of 48 cores was down to 58%. In the following, we are going to see a comparison of parallel performance in different chunk sizes. Comparison in Different Chunk Sizes
The above table lists elapsed time with respect to chunk sizes 64x64 and 48x48. From the above, we can see that the bigger chunk size, 64x64, run faster than the smaller chunk size, 48x48, when enabling less than 32 cores. For example, on one core, it took 4501.97 seconds to decompose the matrix in 64x64 subblocks, and took 4899.73 seconds in 48x48 subblocks; When enabling 31 cores, it took 208.68 seconds to decompose the matrix in 64x64 subblocks, and took 211.66 seconds to decompose in 48x48 subblocks. When enabling more than 31 cores, we can see a different result that smaller chunk, 48x48, run faster. For example, when enabling 32 cores, smaller chunk produced a faster computing, e.g., 206.87 seconds; While 64x64 subblocks required 207.08 seconds to complete the decomposition. When enabling 48 cores, it took only 175.80 seconds to decompose in 48x48 subblocks; While 64x64 subblocks required 208.88 seconds. This post presents a comparison to show chunk size may affect parallel performance. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||