next up previous
Next: MG Up: NAS Parallel Benchmarks Previous: CG

FT

Kernel FT is the benchmark of a three dimensional Fourier transform program. Like as kernel CG, I should use overlapping communication over calculation, and direct transfer into the user memory. For the three dimensional Fourier transform, each dimensional transformation can be executed as one way pipeline that is not the case in the kernel CG. Then I can prepare the waiting processes for later stage calculation and can fully utilize the psend/precv functions. Thus with kernel FT, overlapping will be more efficient and easy to use than with kernel CG. In the Kernel FT execution, it calculate 4 to 6 independent reverse FFT. I assigned the independent parts on the fraction of my network segment. The result of this assignment shows very good performance achievement.

The communication data amount is calculated as follows.

  eqnarray145

And the calculated results are shown in the table5.

   table149
Table 5: The communication data size versus problem size(in Maga bytes). It depend on the number of processing elements(PE).

   table161
Table 6: The execution time versus the number of processors with SUN SparcStation-2s. The problem size is tex2html_wrap_inline479 .

The ratio of execution time with problem size of tex2html_wrap_inline483 , and COMM size=196[MB], and the number of processors are 8 is as followings.
Communication idle(receive wait, sync) 79% = 69% and 10%
Calculation 15%
send/receive 6%

I improved the overall performance with:

  1. using psend/precv to bypass the memory copy operation to/from communication library (10.3%),
  2. overlapping communication over calculation (14.4%),
  3. to schedule communication pattern (3.7%).

The algorithm for communication pattern scheduling is like following.

tabular179

   table191
Table 7: The performance improvement with psend/precv, overlapping, scheduling of communications. The case 1 is tex2html_wrap_inline479 problem with 4 processors, the case 2 is class A problem with 8 processors, the case 3 is class A problem with 16 processors. The model 1 is original(non-tuned), model 2 is psend/precv, model 3 to 4 overlapping variations, model 5-6 scheduling variations, respectively.

The switching technology is one of the key feature for higher performance. And we also evaluated some patterns(see the table8).

   table201
Table 8: The execution time with 100Base-T switching network versus normal 10Base-T network,

Additionary, kernel FT has special feature for networking. It calculate one FFT and 6 times inverse FFT to derive time domain response. In the inverse FFT phase, each steps are independent and can calculate separately. We can use this property for improve performance. The table9 shows the utilization of this feature with 8 processors per segment and class A problems. The table10 shows the utilization of this feature with 16 processors per segment and class A problems.

   table214
Table 9: The execution time with segmented Ethernet networks. The problem size is tex2html_wrap_inline483 , then COMM size is 196MB. As shown in this table, almost linear property was achieved respect to the number of the segment.

   table223
Table 10: The execution time with segmented Ethernet networks. The problem size is tex2html_wrap_inline483 , then COMM size is 210MB. As shown in this table, almost linear property was achieved respect to the number of the segment.


next up previous
Next: MG Up: NAS Parallel Benchmarks Previous: CG

Naohiko Shimizu
Wed Feb 5 10:44:22 JST 1997