Grandpa on Multicores (1)

[Posted by Jenn-Ching Luo on Mar. 28, 2011 ]

        GFORTRAN (version 4.6) supports quad-math (e.g., 16-byte REAL and COMPLEX variables). This writer starts a series of performance tests of gfortran quad-math on the parallel numerical package LAIPE. This post is the first one of the series.

        LAIPE is a library of parallel direct solvers, released in 1995. To a software life cycle, LAIPE is a grandpa. We are going to see how grandpa runs on multicores.

        When LAIPE was developed, this world did not have multicores. LAIPE was programmed in MTASK with ancient parallel concepts, which include three parallel functions only, task manipulation, parallel lock and parallel event. As compared with MTASK, modern parallel programming languages (e.g., OpenMP), including plenty of functions for parallel computing, is a monster.

        How old grandpa LAIPE is? You may be surprised. Some terminologies programmed in LAIPE are already abandoned in modern parallel computing languages. For example, parallel event, which was programmed in LAIPE, was not seen in modern parallel programming languages. This writer did find the ancient concept, parallel event, is a useful means to synchronize tasks, especially when parallelism is asynchronous. A cooperating task can be guided to perform a particular function, based on the status of event, at the critical moment when a task synchronizes with others, without sitting idle. Parallel event allows tasks to perform extra functions at a critical synchronization moment, which "parallel lock" or "barrier" cannot provide. However, the majority do not favor parallel event in modern parallel computing. Modern parallel programming languages do not have parallel event. Grandpa uses "ancient" terminologies. It is interesting for us to see if ancient terminologies work on modern hardware.

        Most LAIPE programs are in fortran-77, with one distinguish exception that declares local variables (e.g., loop index) in subroutine with an "automatic" attribute. Fortran-77 allocates "static variables" only. That is the only distinguish exception programmed in LAIPE. It is also interesting to see if grandpa LAIPE, programmed in fortran-77, could yield a speedup on modern multicores.

        Grandpa has all kinds of parallel direct solvers for scientific and engineering computing. The first test is on the parallel band solver (CSP) of system equations [A]{X}={B}, where matrix [A] is symmetric and positive definite with a constant bandwidth, e.g., in the following form:

 /                                  \
 |  x                               |
 |  x  x                            |
 |  x  x  x              sym.       |
 |     x  x  x                      |
 |        x  x  x                   |
 |           x  x  x                |
 |              x  x  x             |
 |                 x  x  x          |
 |                                  |
 |                         ....     |
 \                               x  /

TESTING PLATFORM

The compiler is gfortran 4.6, with the optimization option -O3. The test was also on a grandpa computer, a SunFire v40z with quad dual-core AMD Opteron 870 with 16 MB of RAM, a total of 8 cores. Windows server 2008 R2 was installed.

TIMING RESULTS 1 (16-BYTE REAL)

The first test is on 16-byte real variables. The matrix [A] is of order (10,000x10,000) with a half bandwidth 300. This test calls the LAIPE subroutine decompose_CSP_16 to decompose the matrix [A] into triangular matrices. The CSP family is a parallel version of Cholesky decomposition. The test example was first implemented on 1 core, and then on 2 cores, ..., and finally on 8 cores. Timing results are as follows:

Processor: 1
Elapsed Time (Seconds): 154.44
CPU Time in User mode (Seconds): 154.41
CPU Time in Kernel mode (Seconds): 0.03
Total CPU Time (Seconds): 154.44

Processors: 2
Elapsed Time (Seconds): 79.83
CPU Time in User mode (Seconds): 159.37
CPU Time in Kernel mode (Seconds): 0.05
Total CPU Time (Seconds): 159.42

Processors: 3
Elapsed Time (Seconds): 54.30
CPU Time in User mode (Seconds): 161.57
CPU Time in Kernel mode (Seconds): 0.03
Total CPU Time (Seconds): 161.60

Processors: 4
Elapsed Time (Seconds): 41.15
CPU Time in User mode (Seconds): 162.43
CPU Time in Kernel mode (Seconds): 0.05
Total CPU Time (Seconds): 162.48

Processors: 5
Elapsed Time (Seconds): 33.23
CPU Time in User mode (Seconds): 163.32
CPU Time in Kernel mode (Seconds): 0.05
Total CPU Time (Seconds): 163.36

Processors: 6
Elapsed Time (Seconds): 28.84
CPU Time in User mode (Seconds): 163.41
CPU Time in Kernel mode (Seconds): 0.06
Total CPU Time (Seconds): 163.47

Processors: 7
Elapsed Time (Seconds): 25.10
CPU Time in User mode (Seconds): 163.93
CPU Time in Kernel mode (Seconds): 0.11
Total CPU Time (Seconds): 164.04

Processors: 8
Elapsed Time (Seconds): 21.40
CPU Time in User mode (Seconds): 164.57
CPU Time in Kernel mode (Seconds): 0.06
Total CPU Time (Seconds): 164.63

When implementing the test, the computer was "stand-alone", e.g., without running other user-application. But, that was not equivalent to the computer had only one job. The Windows server 2008 had many services that were active, which also consumed CPU time. It could be realized that the actual performance is better than the timing result. It also can be seen from the timing results that the elapsed time is almost linearly reduced. The following table shows the speedup and efficiency.

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	154.44	1.00	100.00
2	79.83	1.94	96.73
3	54.30	2.84	94.81
4	41.15	3.75	93.83
5	33.23	4.65	92.95
6	28.84	5.36	89.25
7	25.10	6.15	87.90
8	21.40	7.22	90.21

        The first column is the number of cores; the second column is the elapsed time in seconds. It can be seen the elapsed time is almost linearly reduced when employing more cores; the third column is speedup, which is the ratio of the elapsed time on 1 core to the elapsed time on multiple cores; the fourth column is efficiency, which is the ratio of speedup to the number of cores.

        From the above table, we can see grandpa efficiently ran on multicores. The software was developed in early 1990s when mutlicores were not invented yet, especially being programmed in ancient parallel concepts with only three parallel functions (e.g., task manipulation, parallel lock and parallel event) and fortran-77. Grandpa LAIPE has no problem running on modern computer. Grandpa almost reduces elapsed time linearly when employing more cores.

        The fourth column also shows grandpa efficiently ran on multicores. When 8 cores were employed, efficiency was up to 90%. The above timing is for decomposing a matrix into triangular matrices, not only parallelizing a loop. Grandpa can efficiently run on multicores.

        The result is a little phenomenal. Normally, efficiency is decreasing. More cores lead to less efficiency. However, in the above table, we can see the efficiency of 8 cores is higher than 6 cores and 7 cores. This writer has found this strange behavior on the computer on other tests. The actual cause is unclear.

TIMING RESULTS 2 (16-BYTE COMPLEX)

        In the second test, the coefficients of matrix [A] were declared as 16-byte complex variables. The order and half bandwidth of matrix [A] remain the same, 10,000x10,000 and 300, respectively. The second test calls the subroutine decompose_CSP_16z to decompose the matrix [A]. Timing results are as follows:

Processor: 1
Elapsed Time (Seconds): 588.22
CPU Time in User Mode (Seconds): 588.14
CPU Time in Kernel Mode (Seconds): 0.06
Total CPU Time (Seconds): 588.20

Processors: 2
Elapsed Time (Seconds): 298.65
CPU Time in User Mode (Seconds): 596.53
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 596.63

Processors: 3
Elapsed Time (Seconds): 200.73
CPU Time in User Mode (Seconds): 598.67
CPU Time in Kernel Mode (Seconds): 0.03
Total CPU Time (Seconds): 598.70

Processors: 4
Elapsed Time (Seconds): 152.29
CPU Time in User Mode (Seconds): 600.32
CPU Time in Kernel Mode (Seconds): 0.08
Total CPU Time (Seconds): 600.40

Processors: 5
Elapsed Time (Seconds): 121.07
CPU Time in User Mode (Seconds): 601.57
CPU Time in Kernel Mode (Seconds): 0.12
Total CPU Time (Seconds): 601.70

Processors: 6
Elapsed Time (Seconds): 107.09
CPU Time in User Mode (Seconds): 601.98
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 602.07

Processors: 7
Elapsed Time (Seconds): 93.49
CPU Time in User Mode (Seconds): 603.18
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 603.27

Processors: 8
Elapsed Time (Seconds): 80.32
CPU Time in User Mode (Seconds): 603.22
CPU Time in Kernel Mode (Seconds): 0.16
Total CPU Time (Seconds): 603.38

COMPLEX variables require more operations than REAL variables, and take about 3.8X more time. Our interest is speedup. The elapsed times also show a linear speedup, and are summarized in the following table.

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	588.22	1.00	100.00
2	298.65	1.97	98.48
3	200.73	2.93	97.68
4	152.29	3.86	96.56
5	121.07	4.86	97.17
6	107.09	5.49	91.55
7	93.49	6.29	89.99
8	80.32	7.32	91.54

Everything works well. Elapsed time is almost linearly reduced when employing more cores.

        The above test also shows grandpa can efficiently run on modern multicores. The efficiency of 8 cores is up to 91%. Most grandpa programs were written in fortran-77 with three basic parallel functions (task manipulation, parallel lock and parallel event). Grandpa shows us one thing that ancient parallel programming technologies are sufficient for parallel computing, and it is unnecessary to keep developing new parallel programming languages. Grandpa, programmed in ancient terminologies and fortran-77, can efficiently run on modern multicores. Parallel computing should focus on parallel algorithms, not on programming languages.

        In this test, complex variables show a better speedup than REAL variables. But, that is not always true. The phenomenal efficiency also can be seen when 7 cores were employed. Suppose that 7 cores have a better efficiency than 8 cores, but the test shows conversely.

        Grandpa was programmed in fortran-77 and three basic parallel functions (e.g., task manipulation, parallel lock and parallel event). Grandpa can efficiently take advantage of modern multicores. Is it necessary to keep developing parallel programming languages? That is a question.