Parallel Performance of Skyline Solver on Soft Cores (I)

[Posted by Jenn-Ching Luo on Mar. 17, 2012 ]

This post shows parallel performance of the following functions of LAIPE2:

      laipe$decompose_VSP_4
      laipe$decompose_VSP_8
      laipe$decompose_VSP_10
      laipe$decompose_VSP_16

The above functions, on soft core technology, decompose a sparse, symmetric and positive definite matrix [A] in a skyline form into [L][L]^T in parallel. The tailing numbers in those functions represent 4-byte single precision, 8-byte double precision, 10-byte extended precision and 16-byte quad precision, respectively.

Matrix of skyline solver has a form, for example,

    /                          \
    |  x  x     x              |
    |     x  x  x     x        |
    |        x  x  x  x        |
    |           x  x  x     x  |
    |              x  x     x  |
    |     sym.        x  x  x  |
    |                    x  x  |
    |                       x  |
    \                          /

which looks like a skyline of city. Skyline solver is one of the most important tools in engineering and scientific computing. Programming parallel skyline solver is a hard assignment. This post has parallel performance of such difficult-programming solvers.

TEST PROGRAM AND PLATFORM

        The test program in FORTRAN uses a pseudo random process to generate data. In the test run, the program was linked against homogenous soft cores with optimization option -O3, and was implemented on a Windows server 2008 R2 with quad dual-core 870 Opteron, a total of 8 cores. FORTRAN compiler is GFORTRAN 4.8.0 (experimental version).

TIMING RESULTS WITH 8-BYTE VARIABLES

        First, we see the performance with 8-byte variables.

        The example matrix is of order (10,000x10,000). Profile of skyline is generated by a pseudo random process, and has a size 25387391. Timing results are as:

Processor: 1
     Elapsed Time (Seconds): 242.93
     CPU Time in User Mode (Seconds): 242.63
     CPU Time in Kernel Mode (Seconds): 0.30
     Total CPU Time (Seconds): 242.92

Processors: 2
     Elapsed Time (Seconds): 121.82
     CPU Time in User Mode (Seconds): 242.63
     CPU Time in Kernel Mode (Seconds): 0.50
     Total CPU Time (Seconds): 243.13

Processors: 3
     Elapsed Time (Seconds): 81.48
     CPU Time in User Mode (Seconds): 243.14
     CPU Time in Kernel Mode (Seconds): 0.62
     Total CPU Time (Seconds): 243.77

Processors: 4
     Elapsed Time (Seconds): 61.49
     CPU Time in User Mode (Seconds): 243.70
     CPU Time in Kernel Mode (Seconds): 0.75
     Total CPU Time (Seconds): 244.45

Processors: 5
     Elapsed Time (Seconds): 49.61
     CPU Time in User Mode (Seconds): 244.20
     CPU Time in Kernel Mode (Seconds): 0.78
     Total CPU Time (Seconds): 244.98

Processors: 7
     Elapsed Time (Seconds): 36.05
     CPU Time in User Mode (Seconds): 246.58
     CPU Time in Kernel Mode (Seconds): 0.80
     Total CPU Time (Seconds): 247.37

Processors: 8
     Elapsed Time (Seconds): 31.84
     CPU Time in User Mode (Seconds): 248.53
     CPU Time in Kernel Mode (Seconds): 0.94
     Total CPU Time (Seconds): 249.46

It is not difficult for us to see elapsed time was almost linearly reduced when using more cores. We summarize the timing results into the following table to have speedup and efficiency.

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	242.93	1.00	100.00
2	121.82	1.99	99.71
3	81.48	2.98	99.38
4	61.49	3.95	98.77
5	49.61	4.90	97.94
6	41.72	5.82	97.05
7	36.05	6.74	96.27
8	31.84	7.63	95.37

        On two soft cores, the speedup is 1.99x; three cores can yield a 2.98x speedup; on four cores, we can see a speedup 3.95x; on eight cores, we can see a speedup 7.63x. From the last column, efficiency, it is clearer for us to see an almost perfect performance. All the efficiencies are up to 95%.

        The above shows parallel performance of the function, laipe$decompose_VSP_8, which is for double precision variables.

        The following has performance with different types of variables. The 4-byte and 10-byte examples have a matrix of the same order (10,000x10,000) with the same profile size 25387391; the 16-byte example has a smaller matrix of order (2,500x2,500), which has a profile 1567946 that was generated by pseudo random number. All of them show similar parallel performances, up to 95% efficiency of the computer hardware.

TIMING RESULTS WITH 4-BYTE VARIABLES

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	240.51	1.00	100.00
2	120.50	2.00	99.80
3	80.54	2.99	99.54
4	60.70	3.96	99.06
5	48.78	4.93	98.61
6	40.84	5.89	98.15
7	35.22	6.83	97.55
8	31.03	7.75	96.89

TIMING RESULTS WITH 10-BYTE VARIABLES

number of cores	elapsed time (sec.)	efficiency (%)	speedup
1	538.81	1.00	100.00
2	269.45	2.00	99.98
3	180.27	2.99	99.63
4	136.05	3.96	99.01
5	109.64	4.91	98.29
6	92.31	5.84	97.28
7	80.23	6.71	95.94
8	71.26	7.56	94.51

TIMING RESULTS WITH 16-BYTE VARIABLES

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	201.62	1.00	100.00
2	101.37	1.99	99.45
3	67.91	2.97	98.96
4	51.17	3.94	98.50
5	41.08	4.91	98.16
6	34.43	5.86	97.60
7	29.69	6.79	97.01
8	26.13	7.72	96.45