Equation Solution
High Performance by Design
List of Blog Contents
 Page: 2   Parallel Performance of laipe\$decompose_DAG_8 on 48 cores   Chunk Size and Parallel Performance   Parallel Performance of laipe\$decompose_DAG_4 on 48 cores   Parallel Performance of laipe\$decompose_DAG_10 on 48 cores   Execution Time of One-Core-Enabled Parallel Code and Sequential Code

Chunk Size and Parallel Performance

[Posted by Jenn-Ching Luo on Feb. 12, 2016 ]

This post is a follow-up of the previous post, "Parallel Performance of laipe\$Decompose_DAG_4 on 48 cores". In the previous post, the computing was sped up within the range of 35 cores. Beyond the range of 35 cores, no speedup could be seen. The previous post mentioned that the problem could be improved by a "tune-up". There are some ways to improve parallel performance, one of which is to optimize chunk size.

The previous post was implemented with a chunk size of 64x64, e.g., dimension of subblock. This post uses a smaller chunk, 48x48, to show a comparison of parallel performance. We could see chunk size contributes to parallel performance.

The testing example and computing environment are the same as the ones in the previous post. The testing problem is a 4-byte dense matrix of order 20,000-by-20,000. The computing environment is a Dell PowerEdge R815 with quad 1.9GHZ 12-core Opteron on Windows Server 2008. The only difference is chunk size. In this post, the dense matrix was decomposed in a base of 48-by-48 subblocks; While in the previous post, the dense matrix was decomposed in 64-by-64 subblocks.

First, let us see timing result from decomposing the example matrix in 48-by-48 subblocks.

Timing Result

Core: 1
Elapsed Time (Seconds): 4899.73
CPU Time in User Mode (Seconds): 4899.23
CPU Time in Kernel Mode (Seconds): 0.47
Total CPU Time (Seconds): 4899.69

Cores: 2
Elapsed Time (Seconds): 2447.19
CPU Time in User Mode (Seconds): 4881.71
CPU Time in Kernel Mode (Seconds): 0.31
Total CPU Time (Seconds): 4882.02

Cores: 3
Elapsed Time (Seconds): 1648.24
CPU Time in User Mode (Seconds): 4911.44
CPU Time in Kernel Mode (Seconds): 0.44
Total CPU Time (Seconds): 4911.88

Cores: 4
Elapsed Time (Seconds): 1245.68
CPU Time in User Mode (Seconds): 4934.02
CPU Time in Kernel Mode (Seconds): 0.45
Total CPU Time (Seconds): 4934.47

Cores: 5
Elapsed Time (Seconds): 1004.41
CPU Time in User Mode (Seconds): 4951.46
CPU Time in Kernel Mode (Seconds): 0.53
Total CPU Time (Seconds): 4951.99

Cores: 6
Elapsed Time (Seconds): 842.45
CPU Time in User Mode (Seconds): 4963.97
CPU Time in Kernel Mode (Seconds): 0.90
Total CPU Time (Seconds): 4964.87

Cores: 7
Elapsed Time (Seconds): 729.46
CPU Time in User Mode (Seconds): 4998.33
CPU Time in Kernel Mode (Seconds): 0.75
Total CPU Time (Seconds): 4999.08

Cores: 8
Elapsed Time (Seconds): 644.35
CPU Time in User Mode (Seconds): 5027.66
CPU Time in Kernel Mode (Seconds): 0.86
Total CPU Time (Seconds): 5028.52

Cores: 9
Elapsed Time (Seconds): 578.69
CPU Time in User Mode (Seconds): 5058.16
CPU Time in Kernel Mode (Seconds): 1.40
Total CPU Time (Seconds): 5059.56

Cores: 10
Elapsed Time (Seconds): 525.90
CPU Time in User Mode (Seconds): 5087.61
CPU Time in Kernel Mode (Seconds): 1.36
Total CPU Time (Seconds): 5088.97

Cores: 11
Elapsed Time (Seconds): 482.59
CPU Time in User Mode (Seconds): 5114.99
CPU Time in Kernel Mode (Seconds): 1.67
Total CPU Time (Seconds): 5116.66

Cores: 12
Elapsed Time (Seconds): 445.98
CPU Time in User Mode (Seconds): 5141.61
CPU Time in Kernel Mode (Seconds): 1.78
Total CPU Time (Seconds): 5143.38

Cores: 13
Elapsed Time (Seconds): 414.79
CPU Time in User Mode (Seconds): 5154.96
CPU Time in Kernel Mode (Seconds): 1.81
Total CPU Time (Seconds): 5156.77

Cores: 14
Elapsed Time (Seconds): 387.58
CPU Time in User Mode (Seconds): 5169.62
CPU Time in Kernel Mode (Seconds): 2.32
Total CPU Time (Seconds): 5171.95

Cores: 15
Elapsed Time (Seconds): 364.61
CPU Time in User Mode (Seconds): 5191.42
CPU Time in Kernel Mode (Seconds): 2.01
Total CPU Time (Seconds): 5193.43

Cores: 16
Elapsed Time (Seconds): 345.01
CPU Time in User Mode (Seconds): 5217.64
CPU Time in Kernel Mode (Seconds): 2.23
Total CPU Time (Seconds): 5219.87

Cores: 17
Elapsed Time (Seconds): 327.82
CPU Time in User Mode (Seconds): 5241.24
CPU Time in Kernel Mode (Seconds): 3.26
Total CPU Time (Seconds): 5244.50

Cores: 18
Elapsed Time (Seconds): 312.70
CPU Time in User Mode (Seconds): 5276.11
CPU Time in Kernel Mode (Seconds): 2.67
Total CPU Time (Seconds): 5278.78

Cores: 19
Elapsed Time (Seconds): 298.32
CPU Time in User Mode (Seconds): 5291.66
CPU Time in Kernel Mode (Seconds): 2.04
Total CPU Time (Seconds): 5293.71

Cores: 20
Elapsed Time (Seconds): 285.43
CPU Time in User Mode (Seconds): 5311.15
CPU Time in Kernel Mode (Seconds): 2.75
Total CPU Time (Seconds): 5313.89

Cores: 21
Elapsed Time (Seconds): 274.80
CPU Time in User Mode (Seconds): 5340.32
CPU Time in Kernel Mode (Seconds): 2.92
Total CPU Time (Seconds): 5343.24

Cores: 22
Elapsed Time (Seconds): 264.86
CPU Time in User Mode (Seconds): 5367.48
CPU Time in Kernel Mode (Seconds): 3.14
Total CPU Time (Seconds): 5370.62

Cores: 23
Elapsed Time (Seconds): 256.48
CPU Time in User Mode (Seconds): 5411.18
CPU Time in Kernel Mode (Seconds): 2.78
Total CPU Time (Seconds): 5413.95

Cores: 24
Elapsed Time (Seconds): 249.30
CPU Time in User Mode (Seconds): 5459.74
CPU Time in Kernel Mode (Seconds): 3.73
Total CPU Time (Seconds): 5463.47

Cores: 25
Elapsed Time (Seconds): 242.96
CPU Time in User Mode (Seconds): 5519.03
CPU Time in Kernel Mode (Seconds): 3.96
Total CPU Time (Seconds): 5523.00

Cores: 26
Elapsed Time (Seconds): 236.72
CPU Time in User Mode (Seconds): 5570.83
CPU Time in Kernel Mode (Seconds): 3.63
Total CPU Time (Seconds): 5574.46

Cores: 27
Elapsed Time (Seconds): 229.76
CPU Time in User Mode (Seconds): 5584.18
CPU Time in Kernel Mode (Seconds): 4.54
Total CPU Time (Seconds): 5588.72

Cores: 28
Elapsed Time (Seconds): 225.31
CPU Time in User Mode (Seconds): 5650.54
CPU Time in Kernel Mode (Seconds): 4.18
Total CPU Time (Seconds): 5654.72

Cores: 29
Elapsed Time (Seconds): 220.96
CPU Time in User Mode (Seconds): 5712.43
CPU Time in Kernel Mode (Seconds): 3.85
Total CPU Time (Seconds): 5716.28

Cores: 30
Elapsed Time (Seconds): 215.38
CPU Time in User Mode (Seconds): 5727.14
CPU Time in Kernel Mode (Seconds): 3.28
Total CPU Time (Seconds): 5730.42

Cores: 31
Elapsed Time (Seconds): 211.66
CPU Time in User Mode (Seconds): 5793.49
CPU Time in Kernel Mode (Seconds): 3.95
Total CPU Time (Seconds): 5797.43

Cores: 32
Elapsed Time (Seconds): 206.87
CPU Time in User Mode (Seconds): 5811.74
CPU Time in Kernel Mode (Seconds): 4.43
Total CPU Time (Seconds): 5816.17

Cores: 33
Elapsed Time (Seconds): 202.37
CPU Time in User Mode (Seconds): 5822.33
CPU Time in Kernel Mode (Seconds): 4.77
Total CPU Time (Seconds): 5827.11

Cores: 34
Elapsed Time (Seconds): 198.98
CPU Time in User Mode (Seconds): 5867.65
CPU Time in Kernel Mode (Seconds): 4.54
Total CPU Time (Seconds): 5872.19

Cores: 35
Elapsed Time (Seconds): 195.98
CPU Time in User Mode (Seconds): 5922.11
CPU Time in Kernel Mode (Seconds): 5.24
Total CPU Time (Seconds): 5927.35

Cores: 36
Elapsed Time (Seconds): 192.66
CPU Time in User Mode (Seconds): 5952.03
CPU Time in Kernel Mode (Seconds): 4.95
Total CPU Time (Seconds): 5956.98

Cores: 37
Elapsed Time (Seconds): 189.99
CPU Time in User Mode (Seconds): 6009.89
CPU Time in Kernel Mode (Seconds): 5.74
Total CPU Time (Seconds): 6015.63

Cores: 38
Elapsed Time (Seconds): 187.92
CPU Time in User Mode (Seconds): 6081.98
CPU Time in Kernel Mode (Seconds): 7.02
Total CPU Time (Seconds): 6089.00

Cores: 39
Elapsed Time (Seconds): 185.49
CPU Time in User Mode (Seconds): 6137.48
CPU Time in Kernel Mode (Seconds): 6.05
Total CPU Time (Seconds): 6143.54

Cores: 40
Elapsed Time (Seconds): 183.99
CPU Time in User Mode (Seconds): 6196.00
CPU Time in Kernel Mode (Seconds): 7.61
Total CPU Time (Seconds): 6203.61

Cores: 41
Elapsed Time (Seconds): 181.24
CPU Time in User Mode (Seconds): 6235.17
CPU Time in Kernel Mode (Seconds): 7.85
Total CPU Time (Seconds): 6243.02

Cores: 42
Elapsed Time (Seconds): 179.90
CPU Time in User Mode (Seconds): 6299.68
CPU Time in Kernel Mode (Seconds): 7.36
Total CPU Time (Seconds): 6307.04

Cores: 43
Elapsed Time (Seconds): 180.02
CPU Time in User Mode (Seconds): 6433.48
CPU Time in Kernel Mode (Seconds): 6.82
Total CPU Time (Seconds): 6440.30

Cores: 44
Elapsed Time (Seconds): 179.09
CPU Time in User Mode (Seconds): 6499.17
CPU Time in Kernel Mode (Seconds): 7.82
Total CPU Time (Seconds): 6506.99

Cores: 45
Elapsed Time (Seconds): 177.61
CPU Time in User Mode (Seconds): 6595.05
CPU Time in Kernel Mode (Seconds): 7.38
Total CPU Time (Seconds): 6602.43

Cores: 46
Elapsed Time (Seconds): 177.93
CPU Time in User Mode (Seconds): 6691.76
CPU Time in Kernel Mode (Seconds): 9.14
Total CPU Time (Seconds): 6700.90

Cores: 47
Elapsed Time (Seconds): 176.17
CPU Time in User Mode (Seconds): 6738.68
CPU Time in Kernel Mode (Seconds): 8.81
Total CPU Time (Seconds): 6747.50

Cores: 48
Elapsed Time (Seconds): 175.80
CPU Time in User Mode (Seconds): 6830.43
CPU Time in Kernel Mode (Seconds): 7.30
Total CPU Time (Seconds): 6837.73

Elapsed time almost could be reduced when enabling one more core, except when enabling 43 and 46 cores. For example, 42 cores took 179.90 seconds. Logically, 43 cores would take less than 179.90 seconds. However, 43 cores took 180.02 seconds. More cores did not speed up, but slowed it down. One potential cause to the illogical result is memory cache.

From the above results, it also can be seen that in most situations the total CPU time was increased with the number of cores enabled. For example, one core took 4899.69 seconds to decompose the matrix, and 48 cores took 6837.73 seconds for the same job. The increasing CPU time is to pay for cache coherence, which is the main factor to degrade parallel performance in memory sharing environments. Speedup and efficiency are listed in the following.

Speedup and Efficiency

 Numberof Cores ElapsedTime (sec) Speedup Efficiency(%) 1 4899.73 1.0000 100.00 2 2447.19 2.0022 100.11 3 1648.24 2.9727 99.09 4 1245.68 3.9334 98.33 5 1004.41 4.8782 97.56 6 842.45 5.8160 96.93 7 729.46 6.7169 95.96 8 644.35 7.6041 95.05 9 578.69 8.4669 94.08 10 525.90 9.3168 93.17 11 482.59 10.1530 92.30 12 445.98 10.9864 91.55 13 414.79 11.8126 90.87 14 387.58 12.6419 90.30 15 364.61 13.4382 89.59 16 345.01 14.2017 88.76 17 327.82 14.9464 87.92 18 312.70 15.6691 87.05 19 298.32 16.4244 86.44 20 285.43 17.1661 85.83 21 274.80 17.8302 84.91 22 264.86 18.4993 84.09 23 256.48 19.1038 83.06 24 249.30 19.6540 81.89 25 242.96 20.1668 80.67 26 236.72 20.6984 79.61 27 229.76 21.3254 78.98 28 225.31 21.7466 77.67 29 220.96 22.1747 76.46 30 215.38 22.7492 75.83 31 211.66 23.1491 74.67 32 206.87 23.6851 74.02 33 202.37 24.2117 73.37 34 198.98 24.6242 72.42 35 195.98 25.0012 71.43 36 192.66 25.4320 70.64 37 189.99 25.7894 69.70 38 187.92 26.0735 68.61 39 185.49 26.4151 67.73 40 183.99 26.6304 66.58 41 181.24 27.0345 65.94 42 179.90 27.2359 64.85 43 180.02 27.2177 63.30 44 179.09 27.3590 62.18 45 177.61 27.5870 61.30 46 177.93 27.5374 59.86 47 176.17 27.8125 59.18 48 175.80 27.8710 58.06

From the above table, it can be seen that the LAIPE2 subroutine laipe\$Decompose_DAG_4 can yield an efficiency above 90% within the range of 14 cores in this example; While, efficiency of 48 cores was down to 58%. In the following, we are going to see a comparison of parallel performance in different chunk sizes.

Comparison in Different Chunk Sizes

 Numberof Cores Elapsed Time (sec.) 64x64 48x48 1 4501.97 4899.73 2 2136.72 2447.19 3 1433.56 1648.24 4 1079.81 1245.68 5 870.33 1004.41 6 729.94 842.45 7 638.90 729.46 8 567.95 644.35 9 513.43 578.69 10 469.56 525.90 11 434.03 482.59 12 402.70 445.98 13 375.71 414.79 14 352.75 387.58 15 333.64 364.61 16 317.20 345.01 17 302.72 327.82 18 288.65 312.70 19 277.84 298.32 20 266.92 285.43 21 258.49 274.80 22 248.96 264.86 23 242.85 256.48 24 247.93 249.30 25 230.57 242.96 26 226.73 236.72 27 221.71 229.76 28 216.84 225.31 29 214.83 220.96 30 209.79 215.38 31 208.68 211.66 32 207.08 206.87 33 206.95 202.37 34 203.94 198.98 35 201.44 195.98 36 203.92 192.66 37 201.41 189.99 38 202.69 187.92 39 203.55 185.49 40 205.38 183.99 41 203.72 181.24 42 203.78 179.90 43 207.47 180.02 44 208.18 179.09 45 208.14 177.61 46 206.81 177.93 47 210.04 176.17 48 208.88 175.80

The above table lists elapsed time with respect to chunk sizes 64x64 and 48x48. From the above, we can see that the bigger chunk size, 64x64, run faster than the smaller chunk size, 48x48, when enabling less than 32 cores. For example, on one core, it took 4501.97 seconds to decompose the matrix in 64x64 subblocks, and took 4899.73 seconds in 48x48 subblocks; When enabling 31 cores, it took 208.68 seconds to decompose the matrix in 64x64 subblocks, and took 211.66 seconds to decompose in 48x48 subblocks.

When enabling more than 31 cores, we can see a different result that smaller chunk, 48x48, run faster. For example, when enabling 32 cores, smaller chunk produced a faster computing, e.g., 206.87 seconds; While 64x64 subblocks required 207.08 seconds to complete the decomposition. When enabling 48 cores, it took only 175.80 seconds to decompose in 48x48 subblocks; While 64x64 subblocks required 208.88 seconds.

This post presents a comparison to show chunk size may affect parallel performance.