Equation Solution  
    High Performance by Design

Page: 9
Cache Coherency: Parallel Computing

  6   7   8   9  

Cache Coherency: Parallel Computing

[Posted by Jenn-Ching Luo on Aug. 21, 2008 ]

        Instruction and data caching in processor improves execution speed. Here does not explain how the caching is performed. On multiprocessor systems, when one processor performs a write to main memory, the memory cached in other processors is invalidated. Cache coherency consumes extra CPU time to invalidate caching and reload a memory. Certainly, it degrades parallel performance.

        Windows system collects CPU time on all processors (or cores). Total CPU time may show costs for maintaining cache coherency. This page will use timing result implemented on Windows to show negative effect of cache coherency in parallel processing.


        The example applies the parallel band solver, decompose_CSP_4, of LAIPE to decompose a band symmetric matrix, of order (100,000x100,000) with a half bandwidth 600 on a dual core computer. Intel Fortran compiles the parallel program. A timing result is as:

Processor: 1
      Elapsed Time (Seconds): 36.44
      CPU Time in User Mode (Seconds): 36.16
      CPU Time in Kernel Mode (Seconds): 0.08
      Total CPU Time (Seconds): 36.23

Processors: 2
      Elapsed Time (Seconds): 32.84
      CPU Time in User Mode (Seconds): 65.33
      CPU Time in Kernel Mode (Seconds): 0.11
      Total CPU Time (Seconds): 65.44

Based on above result, when one core is employed it took 36.23 seconds of CPU time to decompose the band matrix. Obviously, 36.23 seconds is sufficient to decompose the matrix.

        When two cores were employed, the machine has a burden to maintain cache coherency. Under this circumstance, the machine took a total of 65.44 seconds of CPU time to complete the example, which is 29.21 seconds more than decomposing the matrix. What is for the extra 29.21 seconds? The extra 29.21 seconds are for a maintenance of cache coherency.

        In this example, cache coherency is very costly, 29.21 seconds which is about 29.21/36.23=81% of the cost to decompose the matrix. This is a worst example in which machine spends almost all of its time maintaining cache coherency.


        Maintenance of cache coherency consumes system resource. Due to machine needs to spend CPU time to maintain cache coherency, we need to determine what the best speedup could be achieved after machine pays for cache coherency.

        We define an ultimate speedup, Un, that is the best speedup an application is allowed to achieve after machine pays for a maintenance of cache coherency. The ideal value of relative speedup Sn, for n processors, is n. Ideal ultimate speedup, Un, is less than n. The following shows how to determine Un. Relative speedup is measured as

                        Sn = T1 / Tn


            Sn = relative speedup
            T1 = elapsed time with one processor
            Tn = elapsed time with n processors

Elapsed time, Tn, also can be written as

                        Tn= α (T1+Tc)/n

where Tc is the CPU time for maintaining cache coherency, and α is a coefficient, not less than one. The less α is, the better the parallelism is. The perfect α is one, which means a perfect parallelism. It is not possible for a real problem to have the perfect α, and α is always greater than 1. Speedup can be further rewritten as:

                        Sn = (n/α)(1/(1+Tc/T1))

Since cache coherency cannot be avoided, Tc is always greater than 0, which degrades parallel performance. When Tc goes up, speedup Sn goes down. The parameter α is not less than one. For a perfect parallel program, α is one. We define Un = Sn when α = 1. Ultimate speedup Un is written as:

                        Un = n/(1+Tc/T1)

In the above example,

            T1 = 36.44
            Tc = 29.21
            n = 2

We obtain the ultimate speedup Un = 1.11. After machine pays for a maintenance of cache coherency, the best speedup the machine allows this example to achieve is only 1.11, which is far slower than the ideal speedup 2. The poor performance is because machine spends too much time in maintaining cache coherecy. Cache coherency significantly degrades parallel performance in this example.

        The above reasoning can be verified by the timing result. The actual speedup on two cores is 36.44/32.84=1.1096, which is very close to the ultimate value, 1.11, the machine's limit. This poor performance is from a maintenance of cache coherency.

        Cache coherency is invisible in the source code, but depends on the actual instruction and data caching in processors at running time. We don't know when it may strike a parallel performance. How compiler uses cache to optimize the code also plays an important role.


        Intense memory usage does not necessarily require a high cost for maintaining cache coherency. We can use an example to illustrate it. The above uses 4-byte variables for the computing. In this example, we just change variables to 16 bytes (i.e., quad precision) to verify whether intense memory usage requires a higher cost for maintaining cache coherency. A timing result is as:

Processor: 1
      Elapsed Time (Seconds): 169.11
      CPU Time in User mode (Seconds): 168.25
      CPU Time in Kernel mode (Seconds): 0.02
      Total CPU Time (Seconds): 168.27

Processors: 2
      Elapsed Time (Seconds): 84.81
      CPU Time in User mode (Seconds): 168.89
      CPU Time in Kernel mode (Seconds): 0.00
      Total CPU Time (Seconds): 168.89

        In this example, the matrix size is of order 5000x5000 with a half bandwidth 600. The cost for maintaining cache coherency is only 168.89-168.27=0.62 seconds, which is equivalent to 0.62/168.27=0.37% of the cost to decompose the matrix. The extra cost for cache coherency is nothing in this example, even with an intense memory usage. It can be seen intense memory usage does not absolutely require a higher cost for maintaining cache coherency.

        By the above timing result, it can be calculated the ultimate speedup the machine allows this example to achieve is 1.99. Ultimate speedup in the first example is only 1.11. It is a surprise that the same machine shows a completely different performance on the same parallel program, in which the only change is variable size. The actual speedup in this example is 169.11/84.81=1.99, almost perfect, which has reached the limit the machine allows.

        Cache coherency may kill parallel performance on memory-sharing machine. When machine does not pay a significant cost to maintain cache coherency, a great parallel performance can be expected. Hopefully, machines with a less burden to maintain cache coherency may come out soon.