Grandpa on Multicores (3)

[Posted by Jenn-Ching Luo on Apr. 25, 2011 ]

        This post shows how grandpa LAIPE decomposes a dense, symmetric and positive definite matrix into triangular matrices on multicores.

        Decomposing dense matrix is not a main task in scientific and engineering applications. A reason is that most scientific and engineering problems are approximated into sparse systems. It could be realized that sparse solver is more important than dense solver in scientific and engineering computing.

        Even decomposing dense matrix is of relatively less importance in scientific and engineering computing; there are applications that need dense solvers for a system of symmetric and definite equations. Grandpa LAIPE has functions for such purpose. In this post we are going to see how efficient grandpa decomposes a dense, symmetric and definite matrix on multicores.

        A dense and symmetric matrix has a form, for example, as:

       /                             \
       |  x  x  x  x  x  x  x  x  x  |
       |     x  x  x  x  x  x  x  x  |
       |        x  x  x  x  x  x  x  |
       |           x  x  x  x  x  x  |
 [A] = |              x  x  x  x  x  |
       |                 x  x  x  x  |
       |    sym.            x  x  x  |
       |                       x  x  |
       |                          x  |
       \                             /

Solvers decompose matrix [A] into [U]^T[U] where [U] is an upper triangular matrix. Here does not show how to do it. The algorithm programmed in grandpa is not the traditional one as illustrated in algebra textbooks. Grandpa has its own parallel algorithm to decompose matrix [A].

The testing platform is the same as the first post "Grandpa on Multicores (1)", a quad dual-core opteron 870 (e.g., a SunFire v40z) on a Windows Server 2008 R2. The test example, of order (2500x2500), was in 16-byte REAL variables, and called LAIPE subroutine decompose_CSP_16 to decompose the test matrix into triangular matrices, and was compiled against gfortran 4.6.0 with option -O3.

Timing Result

We got a set of timing results as follows:

core: 1
Elapsed Time (seconds): 911.79
CPU Time in User Mode (seconds): 911.73
CPU Time in Kernel Mode (seconds): 0.05
Total CPU Time (seconds): 911.78

cores: 2
Elapsed Time (seconds): 483.54
CPU Time in User Mode (seconds): 962.15
CPU Time in Kernel Mode (seconds): 0.03
Total CPU Time (seconds): 962.18

cores: 3
Elapsed Time (seconds): 331.75
CPU Time in User Mode (seconds): 983.37
CPU Time in Kernel Mode (seconds): 0.05
Total CPU Time (seconds): 983.41

cores: 4
Elapsed Time (seconds): 253.74
CPU Time in User Mode (seconds): 994.07
CPU Time in Kernel Mode (seconds): 0.08
Total CPU Time (seconds): 994.15

cores: 5
Elapsed Time (seconds): 206.25
CPU Time in User Mode (seconds): 1001.32
CPU Time in Kernel Mode (seconds): 0.02
Total CPU Time (seconds): 1001.34

cores: 6
Elapsed Time (seconds): 174.77
CPU Time in User Mode (seconds): 1008.05
CPU Time in Kernel Mode (seconds): 0.05
Total CPU Time (seconds): 1008.09

cores: 7
Elapsed Time (seconds): 152.05
CPU Time in User Mode (seconds): 1014.24
CPU Time in Kernel Mode (seconds): 0.08
Total CPU Time (seconds): 1014.32

cores: 8
Elapsed Time (seconds): 134.72
CPU Time in User Mode (seconds): 1016.35
CPU Time in Kernel Mode (seconds): 0.09
Total CPU Time (seconds): 1016.44

We summarize the above timing result to have speedup and efficiency as follow:

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	911.79	1.00	100.00
2	483.54	1.89	94.28
3	331.75	2.75	91.61
4	253.74	3.59	89.84
5	206.25	4.42	88.42
6	174.77	5.22	86.95
7	152.05	6.00	85.67
8	134.72	6.77	84.61

        As compared with performance in sparse solver, grandpa LAIPE did not decompose dense, symmetric and definite matrix as efficiently as sparse matrices. For example, in "Grandpa on Multicores (2)", grandpa can speed up to 7.85x on 8 cores, and efficiency is up to 98%. In this dense example, 8 cores only sped up to 6.77x, which was equivalent to 85% of efficiency. The efficiency in this dense example was down to 85%. Dense matrix is the simplest type, but it sounds odd that speedup was down in a simplest type.

        Even the efficiency was down in this dense example, an 85% of efficiency on 8 cores is considered efficient in peer researches. The following shows a citation.

A Performance of Peer Research

        The following cites the paper, "A DAG-based parallel Cholesky factorization for multicore systems", by Dr. J. D. Hogg, to show a performance of peer research.

        Dr. Hogg paper made a similar research to decompose dense, symmetric and definite matrix for direct solvers. Dr. Hogg's paper can be read here.

        Table 6.4 of Dr. Hogg's paper included speedup on multicores. The following copies speedup from Dr. Hogg's paper, (e.g., Table 6.4, n=2500, the same order as the test example). We can compare the performance in decomposing a dense, symmetric and definite matrix.

number of cores	Dr. Hogg's speedup	Speedup of Grandpa LAIPE
1	1.0	1.00
2	1.9	1.89
4	3.5	3.59
8	5.7	6.77

As compared with peer research, grandpa's performance in this example is still strong. On 8 cores, grandpa can speed up to 6.77x; while Dr. Hogg's speedup is 5.7. Even failing to yield an almost perfect speedup, grandpa LAIPE still runs very strongly in the dense example.

The ancient software, grandpa LAIPE, which was developed in early 1990, still efficiently runs on modern multicores. Dr. Hogg's paper is dated 2009. Grandpa LAIPE, even not perfect, still outperforms newly developed solvers.