Equation Solution High Performance by Design |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Implementation of Constant-Bandwidth Solver of LAIPE2 on 48 Cores
[Posted by Jenn-Ching Luo on Nov. 20, 2014 ]
This post shows parallel performance of a LAIPE2 solver on a 48-core computer. Multicore can speed up a computing. Some people might like to see how fast a multicore computer can improve. This post will present a set of timing results to show the performance.
The test was implemented on a Dell Poweredge R815 with four 1.9GHZ Opterons, each of which has 12 cores. Only the basic core of Microsoft Server 2008 was installed on the computer, without graphical user interface, e.g., a command-driven environment. The testing problem was a sparse, symmetric, and positive definite system of equations, [A]{X}={B}, where [A] is of order (50,000x50,000) and has a constant bandwidth 7,000. The solution invoked laipe$decompose_CSP_10 and laipe$substitute_CSP_10. The subroutine laipe$decompose_CSP_10 decomposes matrix [A] into a product of triangular matrices; The subroutine laipe$substitute_CSP_10 performs substitutions. Substitution is not a time consuming procedure, and is less interested. This post only shows the timing results which were spent in the subroutine laipe$decompose_CSP_10. QUICK LOOK AT THE IMPRESSIVE PERFORMANCE
From the timing results below, it can be seen one core took 13878.71 seconds to solve the testing problem, which is about 3 hours and 52 minutes. It was a long computing. Complex engineering or scientific problems could be formulated into a system with more than 100,000 unknowns, which could request a time longer than the one in this testing problem. Efficient parallel solvers to speed up a computation are always desired.
An interesting question is how fast 48 cores can solve the testing problem. We are going to see it. From the timing results below, we can see that 48 cores took only 332.84 seconds, about 5 minutes and 30 seconds. Wow! An 5-hour computing on one core could be done in about 5 minutes on 48 cores. It is very clear for us to see that there is no reason to deny parallel computing. Undoubtedly, we can benefit from multicore and parallel programming. One thing is mentioned here. Parallel computing is not everything, either. Sometimes, parallel computing could be an extra burden, and cannot gain a benefit. It could happen when the problem size is small, having no much computations that are able to be distributed among cooperative cores. In such situation, parallel computing cannot gain a benefit, but in contrary may slow a computing. TIMING RESULTS Detailed timing results are listed in the following. The results include elapsed time, CPU time in user mode and in kernel mode, and the total CPU time. The computer when solving the testing problem did not have other running jobs, except the operating system. Under the circumstance, the computer was deemed as standalone. The elapsed time could be applied to measure speedup and efficiency. In general, the total CPU time should be increased when using more cores. More cores require a relatively higher overhead and more synchronizations. However, a "good luck" may reduce the total CPU time. Before an instruction is excuted, data should be loaded onto cache memory. When a core demands a specific data and loads the data onto cache memory, other lucky cores that also demand the data may access the data without a need to load the data. Total CPU time may be reduced if such "good luck" is possible for most of the time. When such "good luck" comes, we could see "super linear performance". Besides "good luck", we also may experience "bad luck" from cache coherency that could degrade parallel performance. From the above briefly description, the following timing results are understandable. The total CPU time was not in an increasing order.
SPEEDUP AND EFFICIENCY The above timing results are summarized into the following table.
The first column is the number of cores that solved the testing problem; The second column is the elapsed time in seconds. From the table, we can see the elapsed time is in a decreasing order. The computing was speeded up with more cores; The third column is the speedup. From the speedup, we see super linear performance in the ranges of 2 to 15 cores. In super linear performance, speedup is greater than the physical cores. For example, 2 cores speed up to 2.1970x. That looks illogical, but it happens and benefits from caching; The fourth column is the efficiency. We also see super linear performance in the range of 2 to 15 cores from efficiency. In super linear performance, efficiency is higher than 100%. In this testing problem, 48 cores can yield an efficiency of 86.87, which is highly efficient. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||