Grandpa on Multicores (5)

[Posted by Jenn-Ching Luo on May 30, 2011 ]

        This post continues the series of parallel performance of grandpa LAIPE on multicore computer. In this post, we are going to see a special type of direct solver, multi-entry solver, for the solution of system equations [A]{X}={B}.

        What is multi-entry solver? Here does not describe the solver in mathematical derivation.

        What is it? The name speaks itself. Multi-entry solver can begin with multiple entries. That fits parallel environments in nature. For example, on a two-core computer, the solver starts at two entries; on an 8-core computer, the solver starts at 8 indices simultaneously. Each core has an individual entry point. Multi-entry solver maps onto physical cores in nature, which is well suitable for multicore computing.

        Some people may ask if it is difficult to convert a direct method into a multi-entry solver. The answer could be a "yes". We usually did not see a direct method that is allowed to simultaneously start at multiple indices. The reason is that procedure of direct method has a fixed operation, in order, not based on convergent rate. How to break a fixed operation to be able to simultaneously start at different indices is a mathematical question, and is not always possible. For example, when decomposing [A] into [L][U], we usually start at the coefficient A₁₁. Is it practical to decompose [A] into [L][U] that simultaneously starts at multiple indices, e.g., A₃₃, A₇₇, A₉₉? That is a mathematical question. It is not always possible to break down a direct method into a multi-entry procedure.

        But, it is easy to convert an iterative method (e.g., Jacobi iteration) into a multi-entry procedure. However, that may create a negative effect, for example, slow rate of convergence. Multi-entry solvers of Grandpa LAIPE are direct method, and have no such negative effect.

        Multi-entry solvers of grandpa LAIPE are a special type of direct solver, which is most suitable for small bandwidth system, e.g., matrix [A] is as:

 /                                \
 |  x  x  x                       |
 |  x  x  x  x                    |
 |  x  x  x  x  x                 |
 |  x  x  x  x  x  x              |
 |     x  x  x  x  x  x           |
 |        x  x  x  x  x  x        |
 |           x  x  x  x  x  x     |
 |              x  x  x  x  x  x  |
 |                 x  x  x  x  x  |
 |                    x  x  x  x  |
 \                                /

In this performance test, the LAIPE subroutine, msSolution_CSP_16, is called.

Similar to previous performance tests, coefficients of the test example were generated by a pseudo random procedure, and were declared as 16-byte REAL variable, and had a size of order (2,000,000x2,000,000) with a small bandwidth 8. The test platform remains the same as the first post "Grandpa on Multicores (1)". We obtained a set of timing results as follows:

Core: 1
Elapsed Time (seconds): 152.07
CPU Time in User mode (seconds): 151.37
CPU Time in Kernel mode (seconds): 0.70
Total CPU Time (seconds): 152.07

Cores: 2
Elapsed Time (seconds): 76.72
CPU Time in User mode (seconds): 152.55
CPU Time in Kernel mode (seconds): 0.73
Total CPU Time (seconds): 153.29

Cores: 3
Elapsed Time (seconds): 51.40
CPU Time in User mode (seconds): 153.30
CPU Time in Kernel mode (seconds): 0.84
Total CPU Time (seconds): 154.14

Cores: 4
Elapsed Time (seconds): 38.75
CPU Time in User mode (seconds): 153.96
CPU Time in Kernel mode (seconds): 0.83
Total CPU Time (seconds): 154.78

Cores: 5
Elapsed Time (seconds): 31.09
CPU Time in User mode (seconds): 154.07
CPU Time in Kernel mode (seconds): 1.05
Total CPU Time (seconds): 155.11

Cores: 6
Elapsed Time (seconds): 26.08
CPU Time in User mode (seconds): 154.55
CPU Time in Kernel mode (seconds): 1.34
Total CPU Time (seconds): 155.89

Cores: 7
Elapsed Time (seconds): 22.31
CPU Time in User mode (seconds): 154.64
CPU Time in Kernel mode (seconds): 1.08
Total CPU Time (seconds): 155.72

Cores: 8
Elapsed Time (seconds): 19.63
CPU Time in User mode (seconds): 155.00
CPU Time in Kernel mode (seconds): 1.08
Total CPU Time (seconds): 156.08

From the above timing result, we can see the elapsed time was almost linearly reduced when employing more cores. We summarize the data to have speedup and efficiency in the following:

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	152.07	1.00	100.00
2	76.72	1.98	99.11
3	51.40	2.96	98.62
4	38.75	3.92	98.11
5	31.09	4.89	97.83
6	26.08	5.83	97.19
7	22.31	6.82	97.37
8	19.63	7.75	96.84

On 8 cores, grandpa LAIPE can efficiently speed up to 7.75x, which is equivalent to a 96.84% of efficiency. The ancient software, grandpa LAIPE, in fortran-77 and ancient parallel tools, can efficiently run on modern multicore computers. We have not seen a newly developed solver that could yield a compatible speedup on 8 cores.