Equation Solution High Performance by Design 




Grandpa on Multicores (3)
[Posted by JennChing Luo on Apr. 25, 2011 ]
This post shows how grandpa LAIPE decomposes a dense, symmetric and positive definite matrix into triangular matrices on multicores.
Decomposing dense matrix is not a main task in scientific and engineering applications. A reason is that most scientific and engineering problems are approximated into sparse systems. It could be realized that sparse solver is more important than dense solver in scientific and engineering computing. Even decomposing dense matrix is of relatively less importance in scientific and engineering computing; there are applications that need dense solvers for a system of symmetric and definite equations. Grandpa LAIPE has functions for such purpose. In this post we are going to see how efficient grandpa decomposes a dense, symmetric and definite matrix on multicores. A dense and symmetric matrix has a form, for example, as: / \  x x x x x x x x x   x x x x x x x x   x x x x x x x   x x x x x x  [A] =  x x x x x   x x x x   sym. x x x   x x   x  \ / The testing platform is the same as the first post "Grandpa on Multicores (1)", a quad dualcore opteron 870 (e.g., a SunFire v40z) on a Windows Server 2008 R2. The test example, of order (2500x2500), was in 16byte REAL variables, and called LAIPE subroutine decompose_CSP_16 to decompose the test matrix into triangular matrices, and was compiled against gfortran 4.6.0 with option O3. Timing Result We got a set of timing results as follows:
core: 1
Elapsed Time (seconds): 911.79 CPU Time in User Mode (seconds): 911.73 CPU Time in Kernel Mode (seconds): 0.05 Total CPU Time (seconds): 911.78 cores: 2 Elapsed Time (seconds): 483.54 CPU Time in User Mode (seconds): 962.15 CPU Time in Kernel Mode (seconds): 0.03 Total CPU Time (seconds): 962.18 cores: 3 Elapsed Time (seconds): 331.75 CPU Time in User Mode (seconds): 983.37 CPU Time in Kernel Mode (seconds): 0.05 Total CPU Time (seconds): 983.41 cores: 4 Elapsed Time (seconds): 253.74 CPU Time in User Mode (seconds): 994.07 CPU Time in Kernel Mode (seconds): 0.08 Total CPU Time (seconds): 994.15 cores: 5 Elapsed Time (seconds): 206.25 CPU Time in User Mode (seconds): 1001.32 CPU Time in Kernel Mode (seconds): 0.02 Total CPU Time (seconds): 1001.34 cores: 6 Elapsed Time (seconds): 174.77 CPU Time in User Mode (seconds): 1008.05 CPU Time in Kernel Mode (seconds): 0.05 Total CPU Time (seconds): 1008.09 cores: 7 Elapsed Time (seconds): 152.05 CPU Time in User Mode (seconds): 1014.24 CPU Time in Kernel Mode (seconds): 0.08 Total CPU Time (seconds): 1014.32 cores: 8 Elapsed Time (seconds): 134.72 CPU Time in User Mode (seconds): 1016.35 CPU Time in Kernel Mode (seconds): 0.09 Total CPU Time (seconds): 1016.44 We summarize the above timing result to have speedup and efficiency as follow:
As compared with performance in sparse solver, grandpa LAIPE did not decompose dense, symmetric and definite matrix as efficiently as sparse matrices. For example, in "Grandpa on Multicores (2)", grandpa can speed up to 7.85x on 8 cores, and efficiency is up to 98%. In this dense example, 8 cores only sped up to 6.77x, which was equivalent to 85% of efficiency. The efficiency in this dense example was down to 85%. Dense matrix is the simplest type, but it sounds odd that speedup was down in a simplest type. Even the efficiency was down in this dense example, an 85% of efficiency on 8 cores is considered efficient in peer researches. The following shows a citation. A Performance of Peer Research The following cites the paper, "A DAGbased parallel Cholesky factorization for multicore systems", by Dr. J. D. Hogg, to show a performance of peer research. Dr. Hogg paper made a similar research to decompose dense, symmetric and definite matrix for direct solvers. Dr. Hogg's paper can be read here. Table 6.4 of Dr. Hogg's paper included speedup on multicores. The following copies speedup from Dr. Hogg's paper, (e.g., Table 6.4, n=2500, the same order as the test example). We can compare the performance in decomposing a dense, symmetric and definite matrix.
As compared with peer research, grandpa's performance in this example is still strong. On 8 cores, grandpa can speed up to 6.77x; while Dr. Hogg's speedup is 5.7. Even failing to yield an almost perfect speedup, grandpa LAIPE still runs very strongly in the dense example. The ancient software, grandpa LAIPE, which was developed in early 1990, still efficiently runs on modern multicores. Dr. Hogg's paper is dated 2009. Grandpa LAIPE, even not perfect, still outperforms newly developed solvers. 
