Equation Solution High Performance by Design |
|||||||
|
|||||||
|
OpenMP: Parallel Computing for Dummies
[Posted by Jenn-Ching Luo on Dec. 03, 2008 ]
Here uses "dummies" to emphasize OpenMP Architecture Review Board's motivation to make parallel computing simple and easy for user. OpenMP preselects a set of base-language constructs, for example, do-construct, as a basis for parallel computing. User identifies supported constructs, and manually inserts directives to assist compiler for a reconstruction of supported constructs into parallel. User does not need to create threads, and does not need to consider the work assigned to threads. OpenMP allows users who do not have a sufficient knowledge of parallel computing to explore parallel computing.
OpenMP SUPPORTS PARTIAL CONSTRUCTS IN BASE LANGUAGE
Parallel computing with OpenMP requires hand-inserted directives. OpenMP assists a compiler acting as a manual-parallelizer. Auto-parallelizer is available with most compilers. Why do we need a manual parallelizer? A reason is that auto-parallelizer is ineffective. Ineffectiveness is not completely a fault of auto-parallelizer. Program itself also contributes to it. Indeed, auto-parallelizer cannot compete with a hand-coded parallel program. There is no answer how effective a manual-parallelizer could be. From time to time, we have a similar question why program does not speedup after OpenMP directives are inserted into a program. Similar to auto-parallelizer, manual-parallelizer with OpenMP does not guarantee a speedup, either. This article does not compare manual-parallelizer with auto-parallelizer, but addresses issues that could push skillful users away from OpenMP if they have other choice.
OpenMP's interfaces assist compiler to parallelize base-language constructs. For example, most OpenMP applications heavily rely on parallelizing do-construct. A underlying question is if OpenMP supports all base-language constructs. Certainly, the answer is NO. With OpenMP, only a portion of base language is applicable to parallelism.
OpenMP HAS A DEFINITION THAT MAY CONTRADICT TO BASE LANGUAGE
There are base-language constructs which are not supported in OpenMP, for example, do while construct,
In fortran,
do while (logical expression) : : end do In C, while (logical expression) { ......} OpenMP does not have a compiler directive to parallelize do while construct [Note: do while construct can be parallelized]. Here shows an example that OpenMP does not support. When applying a manual-parallelizer with OpenMP, parallelism must be fitted into the constructs OpenMP preselects, and usage of base language is restricted. Do you like to enjoy more freedom to have a parallel programming tool other than OpenMP or to restrict usage of base language? Different level users may have a different choice.
OpenMP assists a compiler to reconstruct a program into parallel. For a safe reconstruction, OpenMP set forth its definitions. Unfortunately, some OpenMP's definitions contradict to base language. For example, OpenMP Application Program Interface defines variable as
MISDIRECTING PARALLEL COMPUTING
"A named data storage block, whose value can be defined and redefined during the execution of a program.
Array sections and substrings are not considered variables." (Version 3.0 May 2008). In fortran, array sections can be assigned as variables, for example, the so-called dynamic distribution of memory. Caller can assign a section of array to a dummy agument of a subroutine. OpenMP's definition contradicts to fortran. Why OpenMP needs to make a definition contradicting to base language? One reason is for safe reconstruction. When reconstructing a parallelable region, variables with the region can be declared as shared or private. When a variable is declared private, what may happen? According to OpenMP Application Program Interface,
"For each private variable referenced in the structure block, a new version of the original variable (of the same type and size) is created in memory for each task that contains code associated with the directives."
A new version of the original variable will be created for private variable. Because dummy argument is not required to declare the full range of dimension. Dimension of dummy variable can be declared, for example, as: REAL :: A(1,1)where dummy variable A is a sky-line sparse matrix. There is no safe way for OpenMP to create a new version of dummy vaiable of the same size if an array section is assigned to the dummy variable. OpenMP has no other choice, and array section cannot be considered as a variable. When applying manual parallelizer with OpenMP, users may need to change their programming practice, and has a burden to distinguish the definition between OpenMP and base language.
As mentioned previously, OpenMP provides easy and simple interfaces for user to explore parallel computing. Users identify parallelable constructs to assist compiler for a reconstruction of a program into parallel. That misdirects users to believe identification of parallelable constructs in a program is parallel computing. Most OpenMP applications are misdirected to parallelize do-loops.
CONCLUSION
Parallel computing is not limited to parallelize do-loops. On the contrary, many efficiently parallel applications never parallelize do-loops. For example, the benchmark solving a system of band equations, that could be downloaded at the following link http://www.equation.com/servlet/equation.cmd?fa=laipebenchmark is programmed in an asynchronous parallel algorithm. It is worth seeing a performance of the asynchronous parallelism. The following is an implementation of the benchmark (compiling with gfortran) on an ACER laptop with an AMD X2, model 5420-5038
C:\temp>bench1_gfortran
number of equations: 2000000 Half bandwidth: 8 Processor: 1 Elapsed Time (Seconds): 8.64 CPU Time in User Mode (Seconds): 8.52 CPU Time in Kernel Mode (Seconds): 0.08 Total CPU Time (Seconds): 8.60 Processors: 2 Elapsed Time (Seconds): 4.32 CPU Time in User Mode (Seconds): 8.53 CPU Time in Kernel Mode (Seconds): 0.11 Total CPU Time (Seconds): 8.64 Elapsed time is cut into half when two cores are employed. This is a highly efficient parallel program. We don't need a special hardware to achieve the above speedup. Here emphasizes that the above timing result is for a complete solution of system equations, not just for parallelizing a do-loop. If we apply OpenMP to parallelize do-loops in a solver of system equations, it is a question mark what kind of speedup can be achieved. Parallel computing should emphasize on a development of parallel algorithm, not to focus on identification of parallelable segments in a program for a purpose of reconstruction.
OpenMP assists a compiler to reconstruct a program into parallel. The biggest question is how much speed can be improved by reconstructing a sequential program into parallel. There is a golden principle that once a program is programmed in a sequential algorithm, a freedom to reconstruct the program into parallel is limited in the sequential algorithm. That answers why auto-parallelizer is ineffective. When applying a manual-parallelizer with OpenMP, users need to insert compiler directives in their program to identify a possibly maximal parallelism; Otherwise manual-parallelizer could not speed up a program. It is very clear that the key issue is parallel algorithm no matter it is an auto-parallelizer or manual parallelizer.
It is understandable that auto-parallelizer or manual parallelizer works if an efficient parallel algorithm is coded in a program. However, if parallel algorithm is available, skillful users can directly develop a hand-coded parallel program. Who needs a manual parallelizer to limit their programming freedom in a set of constructs OpenMP preselects? |
||||||
|
|||||||