Grandpa on Multicores (2)

[Posted by Jenn-Ching Luo on Apr. 11, 2011 ]

This second post continuously shows how grandpa runs on multicores. In this post, we are going to see parallel performances of a skyline solver of grandpa, package LAIPE. As introduced in the first post, grandpa was programmed in ancient parallel concepts, whose base language is fortran-77. We are going to see how grandpa runs on multicores. A question that some people may ask is "what skyline solver is."

Skyline solvers do not deal with any problem with city skylines, but solve a special type of system equations, [A]{X}={B}, where matrix [A] has a form, for example, as:

    /                   \
    | x x               |
    |   x x   x   x     |
    |     x x x   x     |
    |       x x x x     |
    |         x x x x   |
    |           x x x   |
    |             x x x |
    |    sym.       x x |
    |                 x |
    \                   /

The non-zero fill-ins (e.g.,"x" in the above matrix) look as a skyline in a city. That gets the name "skyline solver".

        Sparse solvers are usually divided into two categories, constant band solver and variable band solver. The first post, "Grandpa on Multicores (1)", showed parallel performances of constant band solvers. Skyline line solver is a "variable band solver". This post will show parallel performances of variable band solvers.

        Parallelizing variable band solver (e.g., a skyline solver) is more difficult than constant band solver. There are several factors that lead to the difference, one of which is that sparsity of variable band matrix is not uniform. That increases difficulties to balance workloads among cooperating tasks (or threads). In fact, we don't see many parallel skyline solvers. This post provides a unique opportunity for us to see performances of some parallel skyline solvers.

        Is the grandpa package LAIPE still running? We are going to see it. The testing platform is the same as the first post ("Grandpa on Multicores (1)").

TEST RESULTS 1 (16-byte REAL)

        In this test, The LAIPE subroutine, decompose_VSP_16, (e.g., a skyline solver) is called to decompose the matrix [A] into triangular matrices. The family of VSP solvers is for variable band, symmetric and positive definite system. We know many scientific and engineering problems are formulated into a sparse, symmetric and positive definite system. The VSP family is for such computing. This post collects timing results to show parallel performances.

        There are variable band solvers other than VSP family, for example, with an asymmetric or indefinite matrix [A], which will be shown in other coming posts. In this test, the sparsity (e.g., the profile of skyline) of matrix [A] is determined by a random process.

        Coefficients of matrix [A] are initially generated by random numbers. Because matrix [A] must be ensured to be positive definite in this test, the coefficients of [A] have been further processed such that [A] is positive definite. Those coefficients for test do not have any particular meaning, but only for test.

        We got a set of timing results in the following.

Processor: 1
Elapsed Time (Seconds): 1588.65
CPU Time in User mode (Seconds): 1588.50
CPU Time in Kernel mode (Seconds): 0.12
Total CPU Time (Seconds): 1588.62

Processors: 2
Elapsed Time (Seconds): 795.04
CPU Time in User mode (Seconds): 1585.80
CPU Time in Kernel mode (Seconds): 0.12
Total CPU Time (Seconds): 1585.92

Processors: 3
Elapsed Time (Seconds): 531.32
CPU Time in User mode (Seconds): 1584.88
CPU Time in Kernel mode (Seconds): 0.09
Total CPU Time (Seconds): 1584.97

Processors: 4
Elapsed Time (Seconds): 399.61
CPU Time in User mode (Seconds): 1584.63
CPU Time in Kernel mode (Seconds): 0.12
Total CPU Time (Seconds): 1584.75

Processors: 5
Elapsed Time (Seconds): 320.27
CPU Time in User mode (Seconds): 1585.81
CPU Time in Kernel mode (Seconds): 0.08
Total CPU Time (Seconds): 1585.89

Processors: 6
Elapsed Time (Seconds): 268.23
CPU Time in User mode (Seconds): 1584.61
CPU Time in Kernel mode (Seconds): 0.27
Total CPU Time (Seconds): 1584.88

Processors: 7
Elapsed Time (Seconds): 231.91
CPU Time in User mode (Seconds): 1585.50
CPU Time in Kernel mode (Seconds): 0.28
Total CPU Time (Seconds): 1585.78

Processors: 8
Elapsed Time (Seconds): 202.46
CPU Time in User mode (Seconds): 1585.72
CPU Time in Kernel mode (Seconds): 0.22
Total CPU Time (Seconds): 1585.94

At a glance at the timing result, we can find the performance is super. The speedup is almost perfect. We can summarize the above timing results into the following table.

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	1588.65	1.00	100.00
2	795.04	2.00	99.91
3	531.32	2.99	99.67
4	399.61	3.98	99.39
5	320.27	4.96	99.21
6	268.23	5.93	98.71
7	231.91	6.85	97.86
8	202.46	7.85	98.08

        Speedup is almost perfect. 8 cores can speed up to 7.85X. The test is on a parallel skyline solver, not just parallelizing a loop. Efficiency on 8 cores can reach up to 98%. It is not easy to have a parallel skyline solver that could outperform the above performance. Indeed, there is no one was reported. The ancient package, Grandpa LAIPE whose base language is fortran-77, efficiently run on modern multicores.

        Suppose 7 cores can show efficiency better than 8 cores. However, this test shows a converse result when employing 7 cores.

TEST RESULTS 2 (16-byte COMPLEX)

        The second test uses the LAIPE subroutine, decompose_VSP_z16 (e.g., 16-byte COMPLEX version of decompose_VP_16), with 16-byte COMPLEX variables. The matrix order is 4000x4000. Similar to the first test, the coefficients of matrix [A] and the profile (e.g., skyline) of matrix [A] were determined by a process of random numbers.

        The test got a set of timing results as follows:

Processor: 1
Elapsed Time (Seconds): 3116.71
CPU Time in User mode (Seconds): 3116.63
CPU Time in Kernel mode (Seconds): 0.06
Total CPU Time (Seconds): 3116.70

Processors: 2
Elapsed Time (Seconds): 1559.53
CPU Time in User mode (Seconds): 3114.54
CPU Time in Kernel mode (Seconds): 0.09
Total CPU Time (Seconds): 3114.64

Processors: 3
Elapsed Time (Seconds): 1047.75
CPU Time in User mode (Seconds): 3114.58
CPU Time in Kernel mode (Seconds): 0.08
Total CPU Time (Seconds): 3114.65

Processors: 4
Elapsed Time (Seconds): 790.80
CPU Time in User mode (Seconds): 3114.86
CPU Time in Kernel mode (Seconds): 0.11
Total CPU Time (Seconds): 3114.97

Processors: 5
Elapsed Time (Seconds): 638.82
CPU Time in User mode (Seconds): 3114.45
CPU Time in Kernel mode (Seconds): 0.20
Total CPU Time (Seconds): 3114.65

Processors: 6
Elapsed Time (Seconds): 532.10
CPU Time in User mode (Seconds): 3115.23
CPU Time in Kernel mode (Seconds): 0.06
Total CPU Time (Seconds): 3115.29

Processors: 7
Elapsed Time (Seconds): 458.07
CPU Time in User mode (Seconds): 3117.91
CPU Time in Kernel mode (Seconds): 0.22
Total CPU Time (Seconds): 3118.13

Processors: 8
Elapsed Time (Seconds): 404.93
CPU Time in User mode (Seconds): 3116.76
CPU Time in Kernel mode (Seconds): 0.05
Total CPU Time (Seconds): 3116.81

We summarize the timing results into the following table.

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	3116.71	1.00	100.00
2	1559.53	2.00	99.93
3	1047.75	2.97	99.16
4	790.80	3.94	98.53
5	638.82	4.88	97.58
6	532.10	5.86	97.62
7	458.07	6.80	97.20
8	404.93	7.70	96.21

        The second test with 16-byte COMPLEX variables also showed an outstanding parallel performance. Efficiency of 8 cores reached up to 96%, and speedup of 8 cores could run up to 7.70X.

        As compared with the first test that was with 16-byte REAL variables, the first test shows a better speedup 7.85X on 8 cores. One possibility why the second test slowed down speedup, a little bit, to 7.70X on 8 cores is from memory bus. The second test was with 16-byte COMPLEX variables that process more data. One possibility was memory bus did not feed hungry CPUs.

        Parallelizing a skyline solver (or variable band solver) is not an easy task. This test provides an opportunity for us to see some parallel performances of skyline solvers. Grandpa LAIPE, ancient software, still shows a strong performance on modern multicores. How are newly developed multicore applications? We haven't heard a comparative performance.

        In the coming posts, we are going to see parallel performances of other LAIPE parallel solvers.