Equation Solution High Performance by Design 




Grandpa on Multicores (5)
[Posted by JennChing Luo on May 30, 2011 ]
This post continues the series of parallel performance of grandpa LAIPE on multicore computer. In this post, we are going to see a special type of direct solver, multientry solver, for the solution of system equations [A]{X}={B}.
What is multientry solver? Here does not describe the solver in mathematical derivation. What is it? The name speaks itself. Multientry solver can begin with multiple entries. That fits parallel environments in nature. For example, on a twocore computer, the solver starts at two entries; on an 8core computer, the solver starts at 8 indices simultaneously. Each core has an individual entry point. Multientry solver maps onto physical cores in nature, which is well suitable for multicore computing. Some people may ask if it is difficult to convert a direct method into a multientry solver. The answer could be a "yes". We usually did not see a direct method that is allowed to simultaneously start at multiple indices. The reason is that procedure of direct method has a fixed operation, in order, not based on convergent rate. How to break a fixed operation to be able to simultaneously start at different indices is a mathematical question, and is not always possible. For example, when decomposing [A] into [L][U], we usually start at the coefficient A_{11}. Is it practical to decompose [A] into [L][U] that simultaneously starts at multiple indices, e.g., A_{33}, A_{77}, A_{99}? That is a mathematical question. It is not always possible to break down a direct method into a multientry procedure. But, it is easy to convert an iterative method (e.g., Jacobi iteration) into a multientry procedure. However, that may create a negative effect, for example, slow rate of convergence. Multientry solvers of Grandpa LAIPE are direct method, and have no such negative effect. Multientry solvers of grandpa LAIPE are a special type of direct solver, which is most suitable for small bandwidth system, e.g., matrix [A] is as: / \  x x x   x x x x   x x x x x   x x x x x x   x x x x x x   x x x x x x   x x x x x x   x x x x x x   x x x x x   x x x x  \ / Similar to previous performance tests, coefficients of the test example were generated by a pseudo random procedure, and were declared as 16byte REAL variable, and had a size of order (2,000,000x2,000,000) with a small bandwidth 8. The test platform remains the same as the first post "Grandpa on Multicores (1)". We obtained a set of timing results as follows:
Core: 1
Elapsed Time (seconds): 152.07 CPU Time in User mode (seconds): 151.37 CPU Time in Kernel mode (seconds): 0.70 Total CPU Time (seconds): 152.07 Cores: 2 Elapsed Time (seconds): 76.72 CPU Time in User mode (seconds): 152.55 CPU Time in Kernel mode (seconds): 0.73 Total CPU Time (seconds): 153.29 Cores: 3 Elapsed Time (seconds): 51.40 CPU Time in User mode (seconds): 153.30 CPU Time in Kernel mode (seconds): 0.84 Total CPU Time (seconds): 154.14 Cores: 4 Elapsed Time (seconds): 38.75 CPU Time in User mode (seconds): 153.96 CPU Time in Kernel mode (seconds): 0.83 Total CPU Time (seconds): 154.78 Cores: 5 Elapsed Time (seconds): 31.09 CPU Time in User mode (seconds): 154.07 CPU Time in Kernel mode (seconds): 1.05 Total CPU Time (seconds): 155.11 Cores: 6 Elapsed Time (seconds): 26.08 CPU Time in User mode (seconds): 154.55 CPU Time in Kernel mode (seconds): 1.34 Total CPU Time (seconds): 155.89 Cores: 7 Elapsed Time (seconds): 22.31 CPU Time in User mode (seconds): 154.64 CPU Time in Kernel mode (seconds): 1.08 Total CPU Time (seconds): 155.72 Cores: 8 Elapsed Time (seconds): 19.63 CPU Time in User mode (seconds): 155.00 CPU Time in Kernel mode (seconds): 1.08 Total CPU Time (seconds): 156.08 From the above timing result, we can see the elapsed time was almost linearly reduced when employing more cores. We summarize the data to have speedup and efficiency in the following:
On 8 cores, grandpa LAIPE can efficiently speed up to 7.75x, which is equivalent to a 96.84% of efficiency. The ancient software, grandpa LAIPE, in fortran77 and ancient parallel tools, can efficiently run on modern multicore computers. We have not seen a newly developed solver that could yield a compatible speedup on 8 cores. 


