Kernel CG is the benchmark of a linear equation solving program with conjugate gradient method. Almost all computing power will be consumed by matrix-vector multiplication in CG algorithm. Then to have a good performance on kernel CG, fast multiplication is very important.
I evaluated 3 ways for data distribution by the communication data amount(see table1, those are row , column and block distributions), and result in using block distribution.
Figure 1: The block matrix distribution over the processors.
Table 1: The total communication volume versus number of processors by matrix-vector multiplication in the CG class A problem
Now I will show the execution time of my program with four Sun SparcStation-2 and up to 64 NEC EWS4800/320VX (that is MIPS R4400 based workstation).
Table 2: The execution time of CG(Class-A) with four Sun SparcStation-2
The SparcStation-2 is relatively slow machine for the 10Mbps Ethernet and we can see up to 3.7 times speedup with four of them. But as seen in the table3, we can no have good efficiency with faster machines like NEC EWS4800/320VX.
We can calculate the the multi-processor efficiency of the Ethernet type bus connected workstations with following equation.
where w is the multiprocessor efficiency, is the execution time of matrix-vector multiplication with p processors, is one with single processor, is the cpu time of divided part of the multiplication, n is the dimension of the vector, is communication throughput of the transmission media.
The table3 shows the measured and estimated performance of CG(class A) with up to 64 Ethernet connected workstations. The estimation is based on 350KB/S network through-put that is measured under the workstations. For the smaller configuration, the result is much greater than estimation. But with the table4 of 10Base-T, the estimation is almost hit, then there should be something to get such a low performance.
Table 3: The performance of "send overlap" with up to 64 NEC EWS4800/320VX workstations, and estimated performance
The table4 shows the performance of 100Base-T connection versus 10Base-T connection. In this case I do not have a homogenous plat-home, and I used heterogenous workstations (FreeBSD Pentium 90 and 133MHz, OSF1/Alpha 100MHz, Linux/Alpha 21064A-300MHz). Then it may have some drawback with the performance by floating point conversion. Although with the drawback, the result shows good linear performance. With the four workstations, the improvement is about 3.9 times. From the table1, the traffic will be 85MB, and Ethernet can carry 1.1MB/sec, then the required communication time will be 85/1.1 seconds. Now we can derive the net computation time with 201 - 85/1.1 = 123 seconds, that is equal to the 100Base-T's measured result.
Table 4: The execution time with 100Base-T switching network versus normal 10Base-T network, note that ratio to the single processor execution time(18 sec. for sample, 479 for class A) is shown in parenthesis
The "psend/precv" function of PVM can provide buffer bypassing communication when the precv is issued prior to the corresponding psend. With these functions, memory copying overhead from/to communication buffers is reduced, and we can get efficient communication throughput.
To achieve higher performance, the overlapping communication over calculation is required and it requires non-blocking communications. But because non-blocking send/receive routine were not provided in PVM, I tried to use normal process switching mechanisms for the overlapping, but process switching overhead was too large to get higher performance. To achieve overlapping communication over calculation on PVM, I divided the calculation phase and call send function when some of the results are available.
I had better to use MPI or equivalent API that provide non-blocking communications, but at that time PVM was seemed quite stable and at least with the 100Base-T switching network, we can see satisfactory performance improvement.