Equation Solution  
    High Performance by Design
List of Blog Contents  

 
Page: 1
 
Parallel Performance of 8-Byte Matrix Multiplication
 
How Fast 64 Cores Can Improve
 
Parallel Performance of 10-Byte Real Matrix Product on 48 Cores
 
Parallel Dense Solver on 64 Cores
 
Parallel Performance of laipe$decompose_DAG_16 on 48 cores
 

1   2   3   4  
 



Follow-Up (I):
Is Co-Array the Future of Parallel Computing?


[Posted by Jenn-Ching Luo on Jan. 02, 2009 ]

        On November 8, 2008, I raised two concerns regarding co-array in the initial post. After the post, there was a discussion about the article on the newsgroup comp.lang.fortran. I didn't participate the newgroup discussion. The discussion mentioned two points, not in the initial post. One is to implement image in thread, and the other is to decompose and distribute an array to images. I make this follow-up post.

FOLLOW-UP CONCERN 1

        The first concern in the initial post is on how co-array creates cooperating tasks for parallel processing. I cited the definition of image set forth on co-array website (www.co-array.org):

"First, consider work distribution. A single program is replicated a fixed number of times, each replication having its own set of data objects. Each replication of the program is called an image".

For the best interpretation of the above definition, image is a process because the definition shows image has its own set of data objects. Thread has sharable data objects. The initial post raised the question why co-array suggests us to abandon cheap threads and go back to expensive processes for parallel processing.

        On the discussion, a co-array supporter mentioned that image of co-array can be replicated as a thread. Here does not argue whether the definition of image is clear or not. Based on co-array definition set forth on (present) co-array website, image is a replication of a single program. If image of co-array is implemented in thread, then the entire program must be thread-safe. Any fortran program, for example, with SAVE variables or COMMON blocks, is not thread safe. Co-array website should inform users that program must be completely thread safe. However, I could not find such requirement on (present) co-array website.

        If image is a thread, it is even not worth implementing co-array. Co-array shares data by "memory copy"; While thread can directly access memory, without wasting a system resource for a copy. If image is a thread, image can share data by direct memory access, which is more efficient than memory copy as co-array provides. There is no point to implement less efficient co-array, if image is a thread. Furthermore, implementing co-array is "relatively" troublesome. We can use a simple example to illustrate it. Assume we have a parallel program where a matrix aw, of dimension (8,8), is declared as
  real :: aw(8,8)
and assume the program is executed by 4 threads. With direct memory access, Thread 1 can efficiently calculate element aw(1,1) by the following statement:
  aw(1,1) = aw(1,1)+aw(1,3)
It is easily done. With direct memory access, the computing is trouble-free. The following implements co-array. We can see how much trouble we are going to have. As suggested by co-array, we map the matrix aw of dimension (8,8) onto 4 images by the following declaration:
  real :: a(8,2)[4]
Each image has two columns of matrix aw. [Assume: Image 1 is Thread 1, and etc.]. Let us examine how much trouble in calculating aw(1,1)=aw(1,1)+aw(1,3) by co-array:
  • First, Image 1 must determine which image has element aw(1,3). Since each image has 2 columns, Image 2 has aw(1,3).
  • Second, Image 1 needs to determine which local column in Image 2 has element aw(1,3). Local column 1 has aw(1,3).
  • Third, Image 1 makes a local copy of a(1,1)[2] onto a variable, temp. For example,
            temp = a(1,1)[2]
    where a(1,1)[2] is aw(1,3).
  • Fourth, Image 1 calculates a(1,1)=a(1,1)+temp which is equivalent to aw(1,1)=aw(1,1)+aw(1,3).
Implementing co-array is too much trouble. We need to determine which image and which local column have the element aw(1,3); With direct memory access, thread can more efficiently and trouble-free calculate aw(1,1)=aw(1,1)+aw(1,3). From the above simple illustration, it can be seen that co-array shows no benefit but creates a headache.

        As stated in the initial post, it is uncertain whether co-array is the future of parallel computing. Co-array never demonstrates what kind of benefit we may have if we stop using present parallel models, and switch to co-array.

FOLLOW-UP CONCERN 2

        In the initial post, I used
   REAL, DIMENSION(N,N)[*] :: K
on M images as an example to demonstrate co-array may demand a unreasonable amount of memory space. A co-array supporter mentioned a distribution of matrix K to cooperating tasks, for example, by the following statement
   REAL, DIMENSION(N,I)[M] :: K
where I=N/M such that when the program is replicated M times, the memory request for matrix K remains (N,N). His comment is based on an assumption that an ideal data decomposition for each image is possible. Generally, data cannot be arbitrarily decomposed for each image. Decomposition depends on the nature of problem and algorithm. Some algorithms may make a data decomposition very difficult or even impossible. Data decomposition for each image is problem dependent. Here does not have a specific problem to address the issue.

        Co-array supporter mentioned implementing co-array is for a distribution of data to images. It can be realized that distribution of data to cooperating task is necessary and important in message-passing environments; While on memory-sharing machine, since all cooperating tasks can directly access memory, distributing data to tasks is unnecessary. Co-array suggests us to distribute data from a task to another task, which is not only unnecessary on memory-sharing machines but also complicates the computing. For example, assume a matrix aw, of dimension (4,4), is declared as:
  real :: aw(4,4)
The matrix aw is referred as

(Reference A):

/                                                               \
|   aw(1,1)   aw(1,2)   aw(1,3)   aw(1,4)   |
|   aw(2,1)   aw(2,2)   aw(2,3)   aw(2,4)   |
|   aw(3,1)   aw(3,2)   aw(3,3)   aw(3,4)   |
|   aw(4,1)   aw(4,2)   aw(4,3)   aw(4,4)   |
\                                                               /

The following distributes matrix aw, of dimension(4,4), to 2 images, as suggested by co-array, by the following declaration
   real :: a(4,2)[2]
Each image has 2 columns of matrix aw. Then, matrix aw in co-array becomes:

(Reference B):

/                                                                     \
|   a(1,1)[1]   a(1,2)[1]   a(1,1)[2]   a(1,2)[2]   |
|   a(2,1)[1]   a(2,2)[1]   a(2,1)[2]   a(2,2)[2]   |
|   a(3,1)[1]   a(3,2)[1]   a(3,1)[2]   a(3,2)[2]   |
|   a(4,1)[1]   a(4,2)[1]   a(4,1)[2]   a(4,2)[2]   |
\                                                                     /

Reference B is in co-array. Now, we compare Reference A with Reference B. On memory-sharing machine, all cooperating tasks can access data in Reference A; While in Reference B, Image 1 accesses only the first 2 columns. When Image 1 attempts to access the third and fourth columns which are in Image 2, Image 1 needs to make a local copy from Image 2. Reference B in co-array is less efficient than Reference A. No one rational person could believe Reference B in co-array is better than Reference A. Distribution of data to cooperating tasks, as suggested by co-array, not only shows no advantage but also complicates the matter:
  • Comparing Reference A with Reference B, it can be seen that co-array changes a natural notation, for example, aw(1,3), to a strange and complex notation, e.g., a(1,1)[2].
  • Co-array creates a burden to determine which image and local index have a specific element.
  • Since cooperating tasks can directly access memory on memory-sharing machine, co-array wastes system resource to perform unnecessary work, i.e., copying data from one task to another.
Co-array never demonstrates what kind of benefit we may receive if implementing co-array. The above is limited to memory-sharing machine.