Equation Solution
High Performance by Design
 Page: 1   Parallel Performance of 8-Byte Matrix Multiplication   How Fast 64 Cores Can Improve   Parallel Performance of 10-Byte Real Matrix Product on 48 Cores   Parallel Dense Solver on 64 Cores   Parallel Performance of laipe\$decompose_DAG_16 on 48 cores   1   2   3   4   ›

Parallel Performance of 8-Byte Matrix Multiplication

[Posted by Jenn-Ching Luo on June 14, 2016 ]

This post shows a set of parallel performance of an 8-byte matrix product.

There are two related posts. One was a result from a 16-byte (quad precision) matrix product, and the other was from a 10-byte (extended precision) matrix product. Parallel performances of those two sets are inconsistent. Computation of 16-byte matrix product was speeded up to 60 times on 64 cores; While computation of 10-byte matrix product did not show improvement when using more than 20 cores.

In this post, we are going to see parallel performance of another variable type, 8-byte (double precision) matrix product. Computation of 8-byte matrix product was speeded up 40.5x on 48 cores. The performance was also inconsistent to 10-byte variable. Why different variable types would lead to different performances could be explained. This post is not prepared to address technical issues, but to show a parallel performance of an 8-byte (double precision) matrix product.

The performance of 8-byte matrix product shows the computing was speeded up to 40.5x on 48 cores. The parallel performance is as follows.

TESTING EXAMPLE

Compute [C]=[A][B], where matrices [A], [B] and [C] are 8-byte real matrix. Matrix [A] is of order (15000-by-11000), and matrix [B] is of order (11000-by-12000), and matrix [C] is of order (15000-by-12000).

COMPUTING ENVIRONMENT

Computer: a Dell PowerEdge R815 with quad Opteron 6168, a total of 48 cores.
Operating System: Windows Server 2008 R2
Compiler: gfortran with optimization -O3; The application links against neuloop4 for parallel processing.
Subroutine: laipe\$matmul_8 which performs matrix multiplication in parallel

COMPARISON WITH GFORTRAN INTRINSIC FUNCTION MATMUL

GFORTRAN has the intrinsic function, matmul, for matrix multiplication. The intrinsic function matmul is a sequential subroutine, and cannot take advantage of multicore. Before showing parallel performance of laipe\$matmul_8, we are going to have a comparison of laipe\$matmul_8, on one core, with the intrinsic function, matmul.

First, let us see the performance of the intrinsic function, matmul. Timing result is as follows:

Elapsed Time (Seconds): 7265.82
CPU Time in User Mode (Seconds): 7264.33
CPU Time in Kernel Mode (Seconds): 1.50
Total CPU Time (Seconds): 7265.82

The intrinsic function, matmul, took 7265.82 seconds to compute the matrix multiplication. Next, let us see performance of the parallel subroutine laipe\$matmul_8 on one core. We have the following timing result:

Elapsed Time (Seconds): 7086.91
CPU Time in User Mode (Seconds): 7084.63
CPU Time in Kernel Mode (Seconds): 1.95
Total CPU Time (Seconds): 7086.58

When one core enabling, the subroutine laipe\$matmul_8 ran faster than the intrinsic function matmul. laipe\$matmul_8 is a parallel subroutine, which has extra code for parallel processing. Supposedly, laipe\$matmul_8, with extra burden, should be slower than matmul. However laipe\$matmul_8 ran faster than matmul. When only one core enabling, the parallel subroutine laipe\$matmul_8 took only 7086.91 seconds to compute the matrix multiplication.

TIMING RESULT

Timing results include "elapsed time", "CPU time in user mode", "CPU time in kernel mode", and "total CPU time". The timing result on one core to 48 cores is as follows:

number of cores: 1
Elapsed Time (Seconds): 7086.91
CPU Time in User Mode (Seconds): 7084.63
CPU Time in Kernel Mode (Seconds): 1.95
Total CPU Time (Seconds): 7086.58

number of cores: 2
Elapsed Time (Seconds): 3538.93
CPU Time in User Mode (Seconds): 7076.46
CPU Time in Kernel Mode (Seconds): 0.67
Total CPU Time (Seconds): 7077.13

number of cores: 3
Elapsed Time (Seconds): 2360.84
CPU Time in User Mode (Seconds): 7080.57
CPU Time in Kernel Mode (Seconds): 0.89
Total CPU Time (Seconds): 7081.46

number of cores: 4
Elapsed Time (Seconds): 1776.20
CPU Time in User Mode (Seconds): 7085.67
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7086.47

number of cores: 5
Elapsed Time (Seconds): 1419.08
CPU Time in User Mode (Seconds): 7092.35
CPU Time in Kernel Mode (Seconds): 0.70
Total CPU Time (Seconds): 7093.05

number of cores: 6
Elapsed Time (Seconds): 1183.78
CPU Time in User Mode (Seconds): 7099.17
CPU Time in Kernel Mode (Seconds): 0.83
Total CPU Time (Seconds): 7100.00

number of cores: 7
Elapsed Time (Seconds): 1021.89
CPU Time in User Mode (Seconds): 7103.77
CPU Time in Kernel Mode (Seconds): 0.72
Total CPU Time (Seconds): 7104.49

number of cores: 8
Elapsed Time (Seconds): 890.70
CPU Time in User Mode (Seconds): 7105.56
CPU Time in Kernel Mode (Seconds): 0.87
Total CPU Time (Seconds): 7106.44

number of cores: 9
Elapsed Time (Seconds): 796.12
CPU Time in User Mode (Seconds): 7111.38
CPU Time in Kernel Mode (Seconds): 0.98
Total CPU Time (Seconds): 7112.37

number of cores: 10
Elapsed Time (Seconds): 716.06
CPU Time in User Mode (Seconds): 7123.57
CPU Time in Kernel Mode (Seconds): 0.86
Total CPU Time (Seconds): 7124.43

number of cores: 11
Elapsed Time (Seconds): 654.52
CPU Time in User Mode (Seconds): 7125.08
CPU Time in Kernel Mode (Seconds): 0.92
Total CPU Time (Seconds): 7126.00

number of cores: 12
Elapsed Time (Seconds): 597.61
CPU Time in User Mode (Seconds): 7123.61
CPU Time in Kernel Mode (Seconds): 0.81
Total CPU Time (Seconds): 7124.43

number of cores: 13
Elapsed Time (Seconds): 559.47
CPU Time in User Mode (Seconds): 7201.68
CPU Time in Kernel Mode (Seconds): 0.97
Total CPU Time (Seconds): 7202.64

number of cores: 14
Elapsed Time (Seconds): 522.09
CPU Time in User Mode (Seconds): 7224.69
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7225.48

number of cores: 15
Elapsed Time (Seconds): 486.16
CPU Time in User Mode (Seconds): 7213.00
CPU Time in Kernel Mode (Seconds): 0.66
Total CPU Time (Seconds): 7213.66

number of cores: 16
Elapsed Time (Seconds): 458.07
CPU Time in User Mode (Seconds): 7242.91
CPU Time in Kernel Mode (Seconds): 0.92
Total CPU Time (Seconds): 7243.83

number of cores: 17
Elapsed Time (Seconds): 436.55
CPU Time in User Mode (Seconds): 7353.48
CPU Time in Kernel Mode (Seconds): 0.94
Total CPU Time (Seconds): 7354.42

number of cores: 18
Elapsed Time (Seconds): 414.09
CPU Time in User Mode (Seconds): 7353.29
CPU Time in Kernel Mode (Seconds): 0.81
Total CPU Time (Seconds): 7354.11

number of cores: 19
Elapsed Time (Seconds): 394.50
CPU Time in User Mode (Seconds): 7373.03
CPU Time in Kernel Mode (Seconds): 0.75
Total CPU Time (Seconds): 7373.78

number of cores: 20
Elapsed Time (Seconds): 368.94
CPU Time in User Mode (Seconds): 7259.41
CPU Time in Kernel Mode (Seconds): 0.95
Total CPU Time (Seconds): 7260.36

number of cores: 21
Elapsed Time (Seconds): 350.57
CPU Time in User Mode (Seconds): 7265.81
CPU Time in Kernel Mode (Seconds): 0.76
Total CPU Time (Seconds): 7266.57

number of cores: 22
Elapsed Time (Seconds): 336.23
CPU Time in User Mode (Seconds): 7298.26
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 7299.30

number of cores: 23
Elapsed Time (Seconds): 320.69
CPU Time in User Mode (Seconds): 7262.85
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 7263.89

number of cores: 24
Elapsed Time (Seconds): 308.52
CPU Time in User Mode (Seconds): 7318.74
CPU Time in Kernel Mode (Seconds): 1.00
Total CPU Time (Seconds): 7319.74

number of cores: 25
Elapsed Time (Seconds): 297.68
CPU Time in User Mode (Seconds): 7321.63
CPU Time in Kernel Mode (Seconds): 0.81
Total CPU Time (Seconds): 7322.44

number of cores: 26
Elapsed Time (Seconds): 286.96
CPU Time in User Mode (Seconds): 7322.41
CPU Time in Kernel Mode (Seconds): 1.00
Total CPU Time (Seconds): 7323.40

number of cores: 27
Elapsed Time (Seconds): 276.98
CPU Time in User Mode (Seconds): 7329.86
CPU Time in Kernel Mode (Seconds): 0.69
Total CPU Time (Seconds): 7330.55

number of cores: 28
Elapsed Time (Seconds): 267.32
CPU Time in User Mode (Seconds): 7333.12
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7333.92

number of cores: 29
Elapsed Time (Seconds): 258.70
CPU Time in User Mode (Seconds): 7339.24
CPU Time in Kernel Mode (Seconds): 0.94
Total CPU Time (Seconds): 7340.17

number of cores: 30
Elapsed Time (Seconds): 249.40
CPU Time in User Mode (Seconds): 7342.81
CPU Time in Kernel Mode (Seconds): 0.78
Total CPU Time (Seconds): 7343.59

number of cores: 31
Elapsed Time (Seconds): 240.38
CPU Time in User Mode (Seconds): 7350.33
CPU Time in Kernel Mode (Seconds): 1.06
Total CPU Time (Seconds): 7351.39

number of cores: 32
Elapsed Time (Seconds): 233.81
CPU Time in User Mode (Seconds): 7365.81
CPU Time in Kernel Mode (Seconds): 0.67
Total CPU Time (Seconds): 7366.48

number of cores: 33
Elapsed Time (Seconds): 229.56
CPU Time in User Mode (Seconds): 7393.50
CPU Time in Kernel Mode (Seconds): 0.56
Total CPU Time (Seconds): 7394.06

number of cores: 34
Elapsed Time (Seconds): 221.99
CPU Time in User Mode (Seconds): 7401.73
CPU Time in Kernel Mode (Seconds): 0.83
Total CPU Time (Seconds): 7402.56

number of cores: 35
Elapsed Time (Seconds): 216.19
CPU Time in User Mode (Seconds): 7422.51
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 7423.56

number of cores: 36
Elapsed Time (Seconds): 211.94
CPU Time in User Mode (Seconds): 7456.69
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7457.49

number of cores: 37
Elapsed Time (Seconds): 206.97
CPU Time in User Mode (Seconds): 7478.70
CPU Time in Kernel Mode (Seconds): 0.90
Total CPU Time (Seconds): 7479.61

number of cores: 38
Elapsed Time (Seconds): 203.18
CPU Time in User Mode (Seconds): 7513.54
CPU Time in Kernel Mode (Seconds): 0.75
Total CPU Time (Seconds): 7514.29

number of cores: 39
Elapsed Time (Seconds): 198.26
CPU Time in User Mode (Seconds): 7555.64
CPU Time in Kernel Mode (Seconds): 0.87
Total CPU Time (Seconds): 7556.52

number of cores: 40
Elapsed Time (Seconds): 195.20
CPU Time in User Mode (Seconds): 7592.27
CPU Time in Kernel Mode (Seconds): 0.84
Total CPU Time (Seconds): 7593.11

number of cores: 41
Elapsed Time (Seconds): 190.12
CPU Time in User Mode (Seconds): 7647.62
CPU Time in Kernel Mode (Seconds): 0.98
Total CPU Time (Seconds): 7648.60

number of cores: 42
Elapsed Time (Seconds): 187.48
CPU Time in User Mode (Seconds): 7686.03
CPU Time in Kernel Mode (Seconds): 1.03
Total CPU Time (Seconds): 7687.06

number of cores: 43
Elapsed Time (Seconds): 185.58
CPU Time in User Mode (Seconds): 7776.79
CPU Time in Kernel Mode (Seconds): 0.95
Total CPU Time (Seconds): 7777.74

number of cores: 44
Elapsed Time (Seconds): 183.57
CPU Time in User Mode (Seconds): 7835.40
CPU Time in Kernel Mode (Seconds): 0.89
Total CPU Time (Seconds): 7836.29

number of cores: 45
Elapsed Time (Seconds): 178.98
CPU Time in User Mode (Seconds): 7895.26
CPU Time in Kernel Mode (Seconds): 0.95
Total CPU Time (Seconds): 7896.21

number of cores: 46
Elapsed Time (Seconds): 178.65
CPU Time in User Mode (Seconds): 8006.25
CPU Time in Kernel Mode (Seconds): 1.25
Total CPU Time (Seconds): 8007.50

number of cores: 47
Elapsed Time (Seconds): 178.11
CPU Time in User Mode (Seconds): 8093.89
CPU Time in Kernel Mode (Seconds): 0.86
Total CPU Time (Seconds): 8094.75

number of cores: 48
Elapsed Time (Seconds): 174.81
CPU Time in User Mode (Seconds): 8228.79
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 8229.83

From the above timing result, it can be seen elapsed time was reduced proportionally to the reciprocal of the number of cores used. For example, one core took 7086.91 seconds to finish the matrix product, and two cores cut the elapsed time into 3538.93 seconds, and three cores took 2360.84 seconds, and so on.

48 Cores computed the matrix product in 174 seconds. One core took 7086.91 seconds, e.g., 1 hour and 58 minutes. 48 cores could finish an about-2-hour job in less than 3 minutes. In the following, we are going to see parallel speedup and efficiency.

SPEEDUP AND EFFICIENCY

Speedup and efficiency is summarized in the following table. The first column is number of cores; The second column is elapsed time in seconds; The third column is parallel speedup. From the following table, it can be seen that the performance yielded an almost linear speedup. On 48 cores, the speed was improved to 40.5x.

The fourth column is parallel efficiency. It also shows 42 cores could achieve a 90% efficiency. The following table summarizes parallel performance.

 Numberof Cores ElapsedTime (sec) Speedup Efficiency(%) 1 7086.91 1.0000 100.00 2 3538.93 2.0026 100.13 3 2360.84 3.0019 100.06 4 1776.20 3.9899 99.75 5 1419.08 4.9940 99.88 6 1183.78 5.9867 99.78 7 1021.89 6.9351 99.07 8 890.70 7.9566 99.46 9 796.12 8.9018 98.91 10 716.06 9.8971 98.97 11 654.52 10.8276 98.43 12 597.61 11.8588 98.82 13 559.47 12.6672 97.44 14 522.09 13.5741 96.96 15 486.16 14.5773 97.18 16 458.07 15.4712 96.70 17 436.55 16.2339 95.49 18 414.09 17.1144 95.08 19 394.50 17.9643 94.55 20 368.94 19.2088 96.04 21 350.57 20.2154 96.26 22 336.23 21.0776 95.81 23 320.69 22.0989 96.08 24 308.52 22.9707 95.71 25 297.68 23.8071 95.23 26 286.96 24.6965 94.99 27 276.98 25.5864 94.76 28 267.32 26.5110 94.68 29 258.70 27.3943 94.46 30 249.40 28.4158 94.72 31 240.38 29.4821 95.10 32 233.81 30.3106 94.72 33 229.56 30.8717 93.55 34 221.99 31.9245 93.90 35 216.19 32.7809 93.66 36 211.94 33.4383 92.88 37 206.97 34.2412 92.54 38 203.18 34.8800 91.79 39 198.26 35.7455 91.66 40 195.20 36.3059 90.76 41 190.12 37.2760 90.92 42 187.48 37.8009 90.00 43 185.58 38.1879 88.81 44 183.57 38.6060 87.74 45 178.98 39.5961 87.99 46 178.65 39.6692 86.24 47 178.11 39.7895 84.66 48 174.81 40.5406 84.46