Equation Solution  
    High Performance by Design

 
Page: 1
 
Parallel Performance of 8-Byte Matrix Multiplication
 
How Fast 64 Cores Can Improve
 
Parallel Performance of 10-Byte Real Matrix Product on 48 Cores
 
Parallel Dense Solver on 64 Cores
 
Parallel Performance of laipe$decompose_DAG_16 on 48 cores
 

1   2   3   4  


Is Co-Array the Future of Parallel Computing?


[Posted by Jenn-Ching Luo on Nov. 08, 2008 ]
[A follow-up link appended Jan. 02, 2009]


        In May 2005 ISO Fortran Committee decided to include co-arrays as a Fortran Standard. Recently, there are concerns whether "co-array" is the future of parallel computing and why Fortran Committee set co-array as a Fortran Standard.

        Co-array official website, www.co-array.org, introduces co-array as:

"Co-array Fortran is a small extension to Fortran 95. It is a simple, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax. The syntax is architecture-independent and may be implemented not only on distributed memory machines but also on shared memory machines and even on clustered machines."

Co-array set a beautiful goal, architecture-independent. Any architecture has its distinctive feature and advantage. If we attempt to use a syntax to cover all architectures, we may lose the distinctive benefit an architecture provides.

        Based on the introduction provided by co-array official website, the nature of co-array is fortran-syntax message-passing model. In fortran syntax, message is passed by "direct-memory copy". It makes sense that copying memory from one cooperating task to another is suitable for message-passing machine; While on memory-sharing machine, for example, multi-cores, a distinctive feature allows cooperating threads to directly access memory. Why bother passing message from one thread to another as suggested by co-array? Co-array never provides an answer to the question. Whether co-array is the future of parallel computing on memory-sharing machines is in doubt.

CONCERN 1

        First, let us check how co-array creates cooperating tasks. Co-array official website states:
"First, consider work distribution. A single program is replicated a fixed number of times, each replication having its own set of data objects. Each replication of the program is called an image".

Image, a cooperating task for parallel computing, is a process. On message-passing machine, process is the choice. However, on memory-sharing machine, we have thread and process. Which one is better for parallel computing?

        Computing hardware did not have much support for multi-threading in the early days of parallel computing. We had no other choice, but used processes to speed up computing in those days. Processes cannot access common data without extra system calls or interprocess communications. Threads can directly access data in a process, and provide a more efficient way for cooperating tasks to share data. Thread has proved itself a replacement of process for parallel computing. Now, co-array suggests us not to use "cheap" threads but switch back to "expensive" processes for parallel computing. It is uncertain whether co-array is the future of parallel computing. However, implementation of co-array on memory sharing machine brings us back to the pre-thread parallel computing. The future of parallel computing should look forward, not turn back.

CONCERN 2

        Second, let us consider memory usage. For example, co-array declares a matrix [K] of order N-by-N as:
       REAL, DIMENSION(N,N)[*] :: K
Each image has a local space for matrix [K]. If we have M images, co-array requires a total of M*N*N memory space for replicating matrix [K]. If we use threads to directly access matrix [K], the application requires only N*N memory space for matrix [K]. Memory demand between co-array and direct-memory access is a huge difference.

        Dur to co-array replicates everything, co-array is not a good model for memory-sharing machines. The following uses a simple example to illustrate it. For example, if matrix [K] requires 1 gigabyte of memory, and we have a 8-core computer with 2 gigabytes of RAM. We also assume the computer has a 2-gigabyte paging file. The computer has a total of 4 gigabytes of memory space.
  1. When implementing co-array with one image, matrix [K] requires 1 gigabyte of space. Memory space is sufficient, and no problem occurres.
  2. Because co-array replicates everything, an additional image requires additional 1 gigabyte of space to replicate matrix [K]. When implementing co-array with 2 images, it takes 2 gigabytes of space to replicate matrix [K]. Memory space is sufficient, and no problem occurres. However, we can see the memory demand goes up.
  3. When implementing co-array with 8 images, the memory demand for replicating matrix [K] goes up to 8 gigabytes. Because the computer is assumed to have 4 gigabytes of space, the computer craches for insufficient space.
The above only uses matrix [K] to demonstrate memory demand. Replicating image also consumes additional memory space. The above illustration shows that co-array may not fully use the hardware. On a 8-core computer, co-array can employ only 4 cores. Certainly, it is not a good model for parallel computing. A good model should allow hardware to reach the maximal efficiency.

        On the contrary, if we use threads to directly access memory [K], which is only 1 gigabyte, we can fully employ the 8 cores on the example. Threads with direct-memory access shows a better model than co-array. The above simple illustration shows replicating everything as suggested by co-array is not a good model for memory-sharing machine. Co-array may be good for toy example on memory-sharing machines. However, the purpose of parallel computing is to speed up, especially for large-scale problems. In large-scale problem, a matrix may require a space more than 1 gigabyte. We are not solving a toy problem in parallel computing. Certainly, the future of parallel computing is not on toy example.

        Here continues the above illustration. If we implement co-array on 32 cores, the matrix is replicated to 32 gigabytes. If implementing on 128 cores, then co-array replicates matrix [K] to 128 gigabytes. If the application has 5 other matrices of the same size, then an implementation of co-array requires 5x128=640 gigabytes of memory. Co-array that replicates everything on memory-sharing machine may demand a unreasonable amount of memory space. It is uncertain whether co-array is the future of parallel computing. But, it can be imaged that co-array replicating everything has certain limitations on memory-sharing computers.

      Here has a follow-up post.