Equation Solution High Performance by Design |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Parallel Matrix Multiplication on Soft Cores
[Posted by Jenn-Ching Luo on Mar. 01, 2012 ]
Since this post, this writer starts introducing parallel performance of LAIPE2 on multicores.
LAIPE$MATMUL_4 (FOR 4-BYTE REAL VARIABLE)
LAIPE2 follows grandpa LAIPE to provide a library to "Link and in Parallel Execute" programs. The big difference between grandpa LAIPE and LAIPE2 is that grandpa LAIPE is programmed in MTASK and LAIPE2 is based on soft core computing (e.g., neuLoop). This post gives us a first look on LAIPE2 performance. Parallel performance of the function, laipe$matmul, which does matrix multiplication (e.g., [c]=[a][b]), is presented in this post. The function laipe$matmul is programmed in standard matrix multiplication, not in the original form but in a new form for parallel processing. TEST PROGRAM AND PLATFORM The program gets number of physical cores on computer, and initializes matrices [a] and [b], and then repeatedly executes the function laipe$matmul_8 and collects timing results with 1 soft core, 2 soft cores, and etc. The function laipe$matmul_8 is for 8-byte REAL variables, e.g., double precision. After timing the function laipe$matmul_8, the program also times the FORTRAN intrinsic function MATMUL for comparison. The following has the FORTRAN program. Interested user can link the program against laipe2, and see performance on his computer.
The test platform remains the same as the previous post, "Programming Language is Not a Drug to Treat Poor Parallelism". It is a Windows server 2008 R2 with eight cores of opteron 870. The hardware, this writer uses, is out-of-dated. GFORTRAN is applied with the optimization (-O3). TIMING RESULT We collected a set of timing results as follows, including elapsed time, cpu time in user mode, and cpu time in kernel mode.
number of cores: 1
Elapsed Time (Seconds): 19.89 CPU Time in User Mode (Seconds): 19.87 CPU Time in Kernel Mode (Seconds): 0.02 Total CPU Time (Seconds): 19.89 number of cores: 2 Elapsed Time (Seconds): 9.98 CPU Time in User Mode (Seconds): 19.84 CPU Time in Kernel Mode (Seconds): 0.09 Total CPU Time (Seconds): 19.94 number of cores: 3 Elapsed Time (Seconds): 6.76 CPU Time in User Mode (Seconds): 20.14 CPU Time in Kernel Mode (Seconds): 0.06 Total CPU Time (Seconds): 20.20 number of cores: 4 Elapsed Time (Seconds): 5.15 CPU Time in User Mode (Seconds): 20.40 CPU Time in Kernel Mode (Seconds): 0.08 Total CPU Time (Seconds): 20.48 number of cores: 5 Elapsed Time (Seconds): 4.21 CPU Time in User Mode (Seconds): 20.61 CPU Time in Kernel Mode (Seconds): 0.06 Total CPU Time (Seconds): 20.67 number of cores: 6 Elapsed Time (Seconds): 3.59 CPU Time in User Mode (Seconds): 21.20 CPU Time in Kernel Mode (Seconds): 0.09 Total CPU Time (Seconds): 21.29 number of cores: 7 Elapsed Time (Seconds): 3.17 CPU Time in User Mode (Seconds): 21.90 CPU Time in Kernel Mode (Seconds): 0.06 Total CPU Time (Seconds): 21.96 number of cores: 8 Elapsed Time (Seconds): 2.92 CPU Time in User Mode (Seconds): 22.90 CPU Time in Kernel Mode (Seconds): 0.08 Total CPU Time (Seconds): 22.98 Timing in implementing Fortran intrinsic MATMUL Elapsed Time (Seconds): 78.42 CPU Time in User Mode (Seconds): 78.42 CPU Time in Kernel Mode (Seconds): 0.00 Total CPU Time (Seconds): 78.42 First, we see the time in implementing the FORTRAN intrinsic function MATMUL. MATMUL required 78.42 seconds, which was done on one core. Next, we see performance of the function laipe$matmul_8 on one core that took only 19.89 seconds to complete the identical job. The function, laipe$matmul_8, significantly outperforms the FORTRAN intrinsic function MATMUL. We summarize the timing results to see speedup and efficiency of laipe$matmul_8.
From the above table, we can see that elapsed time was linearly reduced with number of soft cores. For example, the elapsed time is reduced from 19.89 seconds to 9.98 second when using 2 cores, which shows a 1.99x speedup and 99.65% efficiency; 4 soft cores can cut the elapsed time to 5.15 seconds, yielding a speedup 3.86x and 96.55% efficiency. It shows an almost linear speedup with number of cores. The above example is in 8-byte REAL variable, e.g., double precision. The following has parallel performance of laipe$matmul in other data types. All of them consistently show efficiently parallel performance.
LAIPE$MATMUL_16 (FOR 16-BYTE REAL VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)
LAIPE$MATMUL_Z16 (FOR 32-BYTE COMPLEX VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||