Equation Solution
High Performance by Design
 Page: 1   Parallel Performance of 8-Byte Matrix Multiplication   How Fast 64 Cores Can Improve   Parallel Performance of 10-Byte Real Matrix Product on 48 Cores   Parallel Dense Solver on 64 Cores   Parallel Performance of laipe\$decompose_DAG_16 on 48 cores   1   2   3   4   ›

Parallel Performance of 8-Byte Matrix Multiplication

[Posted by Jenn-Ching Luo on June 14, 2016 ]

There are two related posts. One was a performance of multiplication of two 16-byte (quad precision) matrices, and the other showed the performance of 10-byte (extended precision) matrix product. The efficiency was inconsistent. Sixty-four cores speeded the product of two 16-byte matrices up to 60 times faster; While multiplication of two 10-byte matrices did not improve the computing speed when using more than 20 cores.

This post shows a parallel performance of an 8-byte matrix product.

Forty-eight cores speeded 8-byte matrix product up to 40.5x. The performance was also inconsistent with a 10-byte matrix product. It is explainable why a different variable type leads to various efficiency. This post is not prepared to address technical issues but showing a parallel performance of an 8-byte (double precision) matrix product.

The performance of the 8-byte matrix product shows 48 cores speeded the computing up to 40.5x. The parallel performance is as follows.

TESTING EXAMPLE

Perform [C]=[A][B], where matrices [A], [B] and [C] are 8-byte real matrix. Matrix [A] is of order (15000-by-11000), and matrix [B] is of order (11000-by-12000), and matrix [C] is of order (15000-by-12000).

COMPUTING ENVIRONMENT

Computer: a Dell PowerEdge R815 with quad Opteron 6168, a total of 48 cores.
Operating System: Windows Server 2008 R2
Compiler: gfortran with optimization -O3; The application links against neuloop4 for parallel processing.
Subroutine: laipe\$matmul_8 which performs matrix multiplication in parallel

COMPARISON WITH GFORTRAN INTRINSIC FUNCTION MATMUL

GFORTRAN has the intrinsic function, matmul, for matrix multiplication. The intrinsic function matmul is a sequential subroutine, and cannot take advantage of multicore. Before showing the parallel performance of laipe\$matmul_8, we are going to have a comparison of laipe\$matmul_8, on one core, with the intrinsic function, matmul.

First, let us see the performance of the intrinsic function, matmul. The timing result is as follows:

Elapsed Time (Seconds): 7265.82
CPU Time in User Mode (Seconds): 7264.33
CPU Time in Kernel Mode (Seconds): 1.50
Total CPU Time (Seconds): 7265.82

The intrinsic function, matmul, took 7265.82 seconds to compute the matrix multiplication. Next, let us see the performance of the parallel subroutine laipe\$matmul_8 on one core. We have the following timing result:

Elapsed Time (Seconds): 7086.91
CPU Time in User Mode (Seconds): 7084.63
CPU Time in Kernel Mode (Seconds): 1.95
Total CPU Time (Seconds): 7086.58

When one core enables, the subroutine laipe\$matmul_8 ran faster than the intrinsic function matmul. laipe\$matmul_8 is a parallel subroutine, which has extra code for parallel processing. Supposedly, laipe\$matmul_8, with extra burden, should be slower than matmul. However, laipe\$matmul_8 ran faster than matmul. When only one core enabling, the parallel subroutine laipe\$matmul_8 took only 7086.91 seconds to perform the matrix multiplication.

TIMING RESULT

Timing results include "elapsed time," "CPU time in user mode," "CPU time in kernel mode," and "total CPU time." The timing result on one core to 48 cores is as follows:

number of cores: 1
Elapsed Time (Seconds): 7086.91
CPU Time in User Mode (Seconds): 7084.63
CPU Time in Kernel Mode (Seconds): 1.95
Total CPU Time (Seconds): 7086.58

number of cores: 2
Elapsed Time (Seconds): 3538.93
CPU Time in User Mode (Seconds): 7076.46
CPU Time in Kernel Mode (Seconds): 0.67
Total CPU Time (Seconds): 7077.13

number of cores: 3
Elapsed Time (Seconds): 2360.84
CPU Time in User Mode (Seconds): 7080.57
CPU Time in Kernel Mode (Seconds): 0.89
Total CPU Time (Seconds): 7081.46

number of cores: 4
Elapsed Time (Seconds): 1776.20
CPU Time in User Mode (Seconds): 7085.67
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7086.47

number of cores: 5
Elapsed Time (Seconds): 1419.08
CPU Time in User Mode (Seconds): 7092.35
CPU Time in Kernel Mode (Seconds): 0.70
Total CPU Time (Seconds): 7093.05

number of cores: 6
Elapsed Time (Seconds): 1183.78
CPU Time in User Mode (Seconds): 7099.17
CPU Time in Kernel Mode (Seconds): 0.83
Total CPU Time (Seconds): 7100.00

number of cores: 7
Elapsed Time (Seconds): 1021.89
CPU Time in User Mode (Seconds): 7103.77
CPU Time in Kernel Mode (Seconds): 0.72
Total CPU Time (Seconds): 7104.49

number of cores: 8
Elapsed Time (Seconds): 890.70
CPU Time in User Mode (Seconds): 7105.56
CPU Time in Kernel Mode (Seconds): 0.87
Total CPU Time (Seconds): 7106.44

number of cores: 9
Elapsed Time (Seconds): 796.12
CPU Time in User Mode (Seconds): 7111.38
CPU Time in Kernel Mode (Seconds): 0.98
Total CPU Time (Seconds): 7112.37

number of cores: 10
Elapsed Time (Seconds): 716.06
CPU Time in User Mode (Seconds): 7123.57
CPU Time in Kernel Mode (Seconds): 0.86
Total CPU Time (Seconds): 7124.43

number of cores: 11
Elapsed Time (Seconds): 654.52
CPU Time in User Mode (Seconds): 7125.08
CPU Time in Kernel Mode (Seconds): 0.92
Total CPU Time (Seconds): 7126.00

number of cores: 12
Elapsed Time (Seconds): 597.61
CPU Time in User Mode (Seconds): 7123.61
CPU Time in Kernel Mode (Seconds): 0.81
Total CPU Time (Seconds): 7124.43

number of cores: 13
Elapsed Time (Seconds): 559.47
CPU Time in User Mode (Seconds): 7201.68
CPU Time in Kernel Mode (Seconds): 0.97
Total CPU Time (Seconds): 7202.64

number of cores: 14
Elapsed Time (Seconds): 522.09
CPU Time in User Mode (Seconds): 7224.69
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7225.48

number of cores: 15
Elapsed Time (Seconds): 486.16
CPU Time in User Mode (Seconds): 7213.00
CPU Time in Kernel Mode (Seconds): 0.66
Total CPU Time (Seconds): 7213.66

number of cores: 16
Elapsed Time (Seconds): 458.07
CPU Time in User Mode (Seconds): 7242.91
CPU Time in Kernel Mode (Seconds): 0.92
Total CPU Time (Seconds): 7243.83

number of cores: 17
Elapsed Time (Seconds): 436.55
CPU Time in User Mode (Seconds): 7353.48
CPU Time in Kernel Mode (Seconds): 0.94
Total CPU Time (Seconds): 7354.42

number of cores: 18
Elapsed Time (Seconds): 414.09
CPU Time in User Mode (Seconds): 7353.29
CPU Time in Kernel Mode (Seconds): 0.81
Total CPU Time (Seconds): 7354.11

number of cores: 19
Elapsed Time (Seconds): 394.50
CPU Time in User Mode (Seconds): 7373.03
CPU Time in Kernel Mode (Seconds): 0.75
Total CPU Time (Seconds): 7373.78

number of cores: 20
Elapsed Time (Seconds): 368.94
CPU Time in User Mode (Seconds): 7259.41
CPU Time in Kernel Mode (Seconds): 0.95
Total CPU Time (Seconds): 7260.36

number of cores: 21
Elapsed Time (Seconds): 350.57
CPU Time in User Mode (Seconds): 7265.81
CPU Time in Kernel Mode (Seconds): 0.76
Total CPU Time (Seconds): 7266.57

number of cores: 22
Elapsed Time (Seconds): 336.23
CPU Time in User Mode (Seconds): 7298.26
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 7299.30

number of cores: 23
Elapsed Time (Seconds): 320.69
CPU Time in User Mode (Seconds): 7262.85
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 7263.89

number of cores: 24
Elapsed Time (Seconds): 308.52
CPU Time in User Mode (Seconds): 7318.74
CPU Time in Kernel Mode (Seconds): 1.00
Total CPU Time (Seconds): 7319.74

number of cores: 25
Elapsed Time (Seconds): 297.68
CPU Time in User Mode (Seconds): 7321.63
CPU Time in Kernel Mode (Seconds): 0.81
Total CPU Time (Seconds): 7322.44

number of cores: 26
Elapsed Time (Seconds): 286.96
CPU Time in User Mode (Seconds): 7322.41
CPU Time in Kernel Mode (Seconds): 1.00
Total CPU Time (Seconds): 7323.40

number of cores: 27
Elapsed Time (Seconds): 276.98
CPU Time in User Mode (Seconds): 7329.86
CPU Time in Kernel Mode (Seconds): 0.69
Total CPU Time (Seconds): 7330.55

number of cores: 28
Elapsed Time (Seconds): 267.32
CPU Time in User Mode (Seconds): 7333.12
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7333.92

number of cores: 29
Elapsed Time (Seconds): 258.70
CPU Time in User Mode (Seconds): 7339.24
CPU Time in Kernel Mode (Seconds): 0.94
Total CPU Time (Seconds): 7340.17

number of cores: 30
Elapsed Time (Seconds): 249.40
CPU Time in User Mode (Seconds): 7342.81
CPU Time in Kernel Mode (Seconds): 0.78
Total CPU Time (Seconds): 7343.59

number of cores: 31
Elapsed Time (Seconds): 240.38
CPU Time in User Mode (Seconds): 7350.33
CPU Time in Kernel Mode (Seconds): 1.06
Total CPU Time (Seconds): 7351.39

number of cores: 32
Elapsed Time (Seconds): 233.81
CPU Time in User Mode (Seconds): 7365.81
CPU Time in Kernel Mode (Seconds): 0.67
Total CPU Time (Seconds): 7366.48

number of cores: 33
Elapsed Time (Seconds): 229.56
CPU Time in User Mode (Seconds): 7393.50
CPU Time in Kernel Mode (Seconds): 0.56
Total CPU Time (Seconds): 7394.06

number of cores: 34
Elapsed Time (Seconds): 221.99
CPU Time in User Mode (Seconds): 7401.73
CPU Time in Kernel Mode (Seconds): 0.83
Total CPU Time (Seconds): 7402.56

number of cores: 35
Elapsed Time (Seconds): 216.19
CPU Time in User Mode (Seconds): 7422.51
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 7423.56

number of cores: 36
Elapsed Time (Seconds): 211.94
CPU Time in User Mode (Seconds): 7456.69
CPU Time in Kernel Mode (Seconds): 0.80
Total CPU Time (Seconds): 7457.49

number of cores: 37
Elapsed Time (Seconds): 206.97
CPU Time in User Mode (Seconds): 7478.70
CPU Time in Kernel Mode (Seconds): 0.90
Total CPU Time (Seconds): 7479.61

number of cores: 38
Elapsed Time (Seconds): 203.18
CPU Time in User Mode (Seconds): 7513.54
CPU Time in Kernel Mode (Seconds): 0.75
Total CPU Time (Seconds): 7514.29

number of cores: 39
Elapsed Time (Seconds): 198.26
CPU Time in User Mode (Seconds): 7555.64
CPU Time in Kernel Mode (Seconds): 0.87
Total CPU Time (Seconds): 7556.52

number of cores: 40
Elapsed Time (Seconds): 195.20
CPU Time in User Mode (Seconds): 7592.27
CPU Time in Kernel Mode (Seconds): 0.84
Total CPU Time (Seconds): 7593.11

number of cores: 41
Elapsed Time (Seconds): 190.12
CPU Time in User Mode (Seconds): 7647.62
CPU Time in Kernel Mode (Seconds): 0.98
Total CPU Time (Seconds): 7648.60

number of cores: 42
Elapsed Time (Seconds): 187.48
CPU Time in User Mode (Seconds): 7686.03
CPU Time in Kernel Mode (Seconds): 1.03
Total CPU Time (Seconds): 7687.06

number of cores: 43
Elapsed Time (Seconds): 185.58
CPU Time in User Mode (Seconds): 7776.79
CPU Time in Kernel Mode (Seconds): 0.95
Total CPU Time (Seconds): 7777.74

number of cores: 44
Elapsed Time (Seconds): 183.57
CPU Time in User Mode (Seconds): 7835.40
CPU Time in Kernel Mode (Seconds): 0.89
Total CPU Time (Seconds): 7836.29

number of cores: 45
Elapsed Time (Seconds): 178.98
CPU Time in User Mode (Seconds): 7895.26
CPU Time in Kernel Mode (Seconds): 0.95
Total CPU Time (Seconds): 7896.21

number of cores: 46
Elapsed Time (Seconds): 178.65
CPU Time in User Mode (Seconds): 8006.25
CPU Time in Kernel Mode (Seconds): 1.25
Total CPU Time (Seconds): 8007.50

number of cores: 47
Elapsed Time (Seconds): 178.11
CPU Time in User Mode (Seconds): 8093.89
CPU Time in Kernel Mode (Seconds): 0.86
Total CPU Time (Seconds): 8094.75

number of cores: 48
Elapsed Time (Seconds): 174.81
CPU Time in User Mode (Seconds): 8228.79
CPU Time in Kernel Mode (Seconds): 1.05
Total CPU Time (Seconds): 8229.83

The above shows that it reduces the elapsed time proportionally to the reciprocal of the number of cores used. For example, one core took 7086.91 seconds to finish the matrix product, and two cores cut the elapsed time into 3538.93 seconds, and three cores took 2360.84 seconds, and so on.

Forty-eight cores completed the computation in 174 seconds; One core took 7086.91 seconds, e.g., 1 hour and 58 minutes. Forty-eight cores could complete a 2-hour job in 3 minutes. In the following, we are going to see parallel speedup and efficiency.

SPEEDUP AND EFFICIENCY

The following table summarizes speedup and efficiency. The number of cores is in the first column; Elapsed time is in the second column; The third column is parallel speedup. In the following table, we can see it yielded an almost linear speedup. Forty-eight cores improved the speed to 40.5x. The fourth column is parallel efficiency. It also shows 42 cores could achieve a 90% efficiency. The following table has the timing results.

 Numberof Cores ElapsedTime (sec) Speedup Efficiency(%) 1 7086.91 1.0000 100.00 2 3538.93 2.0026 100.13 3 2360.84 3.0019 100.06 4 1776.20 3.9899 99.75 5 1419.08 4.9940 99.88 6 1183.78 5.9867 99.78 7 1021.89 6.9351 99.07 8 890.70 7.9566 99.46 9 796.12 8.9018 98.91 10 716.06 9.8971 98.97 11 654.52 10.8276 98.43 12 597.61 11.8588 98.82 13 559.47 12.6672 97.44 14 522.09 13.5741 96.96 15 486.16 14.5773 97.18 16 458.07 15.4712 96.70 17 436.55 16.2339 95.49 18 414.09 17.1144 95.08 19 394.50 17.9643 94.55 20 368.94 19.2088 96.04 21 350.57 20.2154 96.26 22 336.23 21.0776 95.81 23 320.69 22.0989 96.08 24 308.52 22.9707 95.71 25 297.68 23.8071 95.23 26 286.96 24.6965 94.99 27 276.98 25.5864 94.76 28 267.32 26.5110 94.68 29 258.70 27.3943 94.46 30 249.40 28.4158 94.72 31 240.38 29.4821 95.10 32 233.81 30.3106 94.72 33 229.56 30.8717 93.55 34 221.99 31.9245 93.90 35 216.19 32.7809 93.66 36 211.94 33.4383 92.88 37 206.97 34.2412 92.54 38 203.18 34.8800 91.79 39 198.26 35.7455 91.66 40 195.20 36.3059 90.76 41 190.12 37.2760 90.92 42 187.48 37.8009 90.00 43 185.58 38.1879 88.81 44 183.57 38.6060 87.74 45 178.98 39.5961 87.99 46 178.65 39.6692 86.24 47 178.11 39.7895 84.66 48 174.81 40.5406 84.46