Kernel FT is the benchmark of a three dimensional Fourier transform program. Like as kernel CG, I should use overlapping communication over calculation, and direct transfer into the user memory. For the three dimensional Fourier transform, each dimensional transformation can be executed as one way pipeline that is not the case in the kernel CG. Then I can prepare the waiting processes for later stage calculation and can fully utilize the psend/precv functions. Thus with kernel FT, overlapping will be more efficient and easy to use than with kernel CG. In the Kernel FT execution, it calculate 4 to 6 independent reverse FFT. I assigned the independent parts on the fraction of my network segment. The result of this assignment shows very good performance achievement.
The communication data amount is calculated as follows.
And the calculated results are shown in the table5.
Table 5: The communication data size versus problem size(in Maga bytes). It depend on the number of processing elements(PE).
Table 6: The execution time versus the number of processors with SUN SparcStation-2s. The problem size is .
The ratio of execution time with problem size of
, and COMM size=196[MB],
and the number of processors are 8 is as followings.
Communication idle(receive wait, sync) 79% = 69% and 10%
I improved the overall performance with:
The algorithm for communication pattern scheduling is like following.
Table 7: The performance improvement with psend/precv, overlapping, scheduling of communications. The case 1 is problem with 4 processors, the case 2 is class A problem with 8 processors, the case 3 is class A problem with 16 processors. The model 1 is original(non-tuned), model 2 is psend/precv, model 3 to 4 overlapping variations, model 5-6 scheduling variations, respectively.
The switching technology is one of the key feature for higher performance. And we also evaluated some patterns(see the table8).
Table 8: The execution time with 100Base-T switching network versus normal 10Base-T network,
Additionary, kernel FT has special feature for networking. It calculate one FFT and 6 times inverse FFT to derive time domain response. In the inverse FFT phase, each steps are independent and can calculate separately. We can use this property for improve performance. The table9 shows the utilization of this feature with 8 processors per segment and class A problems. The table10 shows the utilization of this feature with 16 processors per segment and class A problems.
Table 9: The execution time with segmented Ethernet networks. The problem size is , then COMM size is 196MB. As shown in this table, almost linear property was achieved respect to the number of the segment.
Table 10: The execution time with segmented Ethernet networks. The problem size is , then COMM size is 210MB. As shown in this table, almost linear property was achieved respect to the number of the segment.