Grandpa on Multicores (6)

[Posted by Jenn-Ching Luo on Aug. 07, 2011 ]

This post continuously shows parallel performance of grandpa LAIPE on multicores. Grandpa LAIPE is an ancient numerical analysis package. Can the ancient software take advantage of modern multicores? This writer is not going to give a magic show. We are going to see actual performance data.

In this post, the subroutine Decompose_VAG_16 of grandpa LAIPE was chosen to demonstrate the performance. VAG subroutine decomposes a variable bandwidth matrix [A] into [L][U], which is for a general sparse matrix, e.g., asymmetric (or unsymmetrical or nonsymmetrical) or indefinite. Matrix [A] has a form, for example:

 /                                \
 |  x  x  x     x                 |
 |  x  x  x     x                 |
 |  x     x  x  x              x  |
 |  x     x  x  x  x           x  |
 |        x  x  x  x  x        x  |
 |        x  x     x  x        x  |
 |        x        x  x  x  x  x  |
 |        x        x  x  x  x  x  |
 |        x        x  x  x  x  x  |
 |        x              x  x  x  |
 \                               /

where the symbol x is a location where data is not zero initially or in decomposition and should be saved, and other blanks are always a zero coefficient and it is unnecessary to allocate a space for those blanks. That is a general type of sparsity.

        Decomposition of a general sparse matrix [A] into [L][U], in parallel, is one of difficult computing. One reason is the parallel computing flow is un-predictable. How to balance workload among cores is a hidden question, and cannot be seen before the decomposition. The difficulty does not come from an examination of sparsity. In fact, examination of sparsity is nothing, and costs very little. The real difficulty is the complex of sparsity, distributed to a core, cannot be estimated until the decomposition takes place.

        LAIPE is at a grandpa age. It does not take extra time of this writer to prepare for a performance on modern multicores. Just let it run, and we have the performance result.

Example

        We need to prepare a matrix [A] for an implementation. The matrix [A], in the example, is of order 25,000-by-25,000. Coefficients of matrix [A] were generated by a pseudo random process. The sparsity was also determined by a pseudo random process. This example had a total of 38,863,244 non-zero fill-ins.

Testing Platform

        The example was implemented on a SunFire v40z with quad dual-core AMD opteron 870, a total of 8 cores. Coefficients were declared in 16-byte variables. The compiler is gfortran 4.6, with optimization option -O3. Windows server 2008 R2 was installed on the computer.

Timing Result

        The computer was alone, without running other user applications. That does not mean the computer only run the example. During the test, Windows Server 2008 R2 has other active processes (e.g., services), which may consume CPU time or other system resources. Actual performance should be better than the timing result. We got a set of timing result in the following:

Processor: 1
Elapsed Time (seconds): 351.81
CPU Time in User Mode (seconds): 350.28
CPU Time in Kernel Mode (seconds): 1.51
Total CPU Time (seconds): 351.80

Processors: 2
Elapsed Time (seconds): 181.63
CPU Time in User Mode (seconds): 358.29
CPU Time in Kernel Mode (seconds): 3.10
Total CPU Time (seconds): 361.39

Processors: 3
Elapsed Time (seconds): 125.71
CPU Time in User Mode (seconds): 366.59
CPU Time in Kernel Mode (seconds): 4.71
Total CPU Time (seconds): 371.30

Processors: 4
Elapsed Time (seconds): 97.20
CPU Time in User Mode (seconds): 371.17
CPU Time in Kernel Mode (seconds): 5.79
Total CPU Time (seconds): 376.96

Processors: 5
Elapsed Time (seconds): 81.14
CPU Time in User Mode (seconds): 375.07
CPU Time in Kernel Mode (seconds): 7.55
Total CPU Time (seconds): 382.62

Processors: 6
Elapsed Time (seconds): 69.53
CPU Time in User Mode (seconds): 377.87
CPU Time in Kernel Mode (seconds): 7.72
Total CPU Time (seconds): 385.59

Processors: 7
Elapsed Time (seconds): 62.60
CPU Time in User Mode (seconds): 381.31
CPU Time in Kernel Mode (seconds): 9.33
Total CPU Time (seconds): 390.64

Processors: 8
Elapsed Time (seconds): 56.83
CPU Time in User Mode (seconds): 384.34
CPU Time in Kernel Mode (seconds): 9.94
Total CPU Time (seconds): 394.28

From the above timing result, we can see elapsed time was reduced when using more cores. The first impression is that grandpa LAIPE can speed up on multicores. To see more details, we summarized the timing results into the following table:

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	351.81	1.00	100.00
2	181.63	1.94	96.85
3	125.71	2.80	93.29
4	97.20	3.62	90.49
5	81.14	4.34	86.72
6	69.53	5.06	84.33
7	62.60	5.62	80.29
8	56.83	6.19	77.38

On 8 cores, the grandpa LAIPE subroutine Decompose_VAG_16 can speed up to 6.19X. Is that a good speedup? If we refer to the performance of LAIPE skyline solver, e.g., Decompose_VSP_16, as shown in "Grandpa on Multicores (2)", we could see LAIPE skyline solver can speed up to 7.85X on 8 cores. The speedup, 6.19X, in this implementation is slower than skyline solver. This post's performance is not the best.

If we compare with SuperLU, which speeded up to 3.7x of 8 cores cited from the post "Is Parallel Computing Easy". The speedup, 6.19X, implemented in this post is much better than SuperLU. A poor performance of grandpa LAIPE still outperforms other numerical packages.