Programming Language is Not a Drug to Treat Poor Parallelism

[Posted by Jenn-Ching Luo on Feb. 14, 2012 ]

        In multicore computing, we usually expect more cores could speed up applications. However, that could not be always true. Every parallelism has its limit, beyond which more cores not only cannot speed up the computing but get worse.

        In the post "Parallelizing Loops on the Face of a Program is not enough for Multicore Computing", we have seen a poor parallelism that OpenMP cannot speed up optimized codes beyond three cores. The example gives us an opportunity to see that a poor parallelism could have a low limit as three cores, beyond the limit more cores cannot further improve the speed.

        Furthermore, experience also shows the parallelism limit does not vary with programming language. If we use different programming languages to parallelize the parallelism, the limit remains the same. Undoubtedly, poor parallelism leads to poor performance regardless the programming language, and programming language is also not a drug to treat poor parallelism.

        We are going to see the limit of the poor example implemented by different programming languages.

THE LIMIT WITH OPENMP

        First, we see the limit with OpenMP. We copy the OpenMP timing results from the post "Parallelizing Loops on the Face of a Program is not enough for Multicore Computing" in the following:

[With Option -O3]
number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	44.44	1.00	100.00
2	25.44	1.75	87.34
3	20.23	2.20	73.22
4	21.96	2.02	50.59
5	22.73	1.96	39.10
6	26.38	1.68	28.08
7	29.25	1.52	21.70
8	29.97	1.48	18.53

We can see from the above results: In the range of 1~3 cores, the elapsed time could be reduced when increasing number of cores. Then, more cores, beyond three, not only could not speed up computing but got worse. The poor parallelism makes OpenMP have no way to speed up the optimized code beyond 3 cores. The range is limited in 3.

THE LIMIT WITH NEULOOP

Next, we are going to see the parallelizable limit with neuLoop.

We parallelize the example program with neuLoop in the same environment as OpenMP, with the same option -O3. As introduced in a previous post, neuLoop has homogeneous and heterogeneous cores. First, we link against heterogeneous cores. The timing results are as:

[Heterogeneous cores]
number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	45.49	1.00	100.00
2	25.88	1.76	87.89
3	19.58	2.32	77.44
4	20.11	2.26	56.55
5	22.26	2.04	40.87
6	26.85	1.69	28.23
7	28.19	1.61	23.05
8	30.95	1.47	18.37

It is consistent to the implementation with OpenMP that neuLoop cannot speed up the optimized code beyond three cores. The parallelism also limits neuLoop to three cores. When using four or more cores, the elapsed time not only cannot be reduced but gets worse. For example, four cores take 20.11 seconds; five cores take 22.26 seconds; eight cores take 30.95 seconds. Supposedly, more cores should shorten the elapsed time. However, the poor parallelism limits neuLoop to three-core improvement.

Next, we link the example program against homogeneous cores, and get the following timing results:

[Homogeneous cores]
number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	45.46	1.00	100.00
2	24.12	1.88	94.24
3	18.97	2.40	79.88
4	21.14	2.15	53.76
5	23.34	1.95	38.95
6	25.57	1.78	29.63
7	26.04	1.75	24.94
8	26.80	1.70	21.20

We do not see a surprise. Homogeneous cores also show a consistent result that the poor parallelism limits neuLoop to three-core improvement.

What decides performance?

We know the answer. It is not programming language, but parallelism. However, parallel language cannot treat a poor parallelism. The key to parallel computing is to develop efficient parallelism that allows more cores to speed up computing.