Latest is '98.12.24 version
Fast DGEMM routine with Blocking but without inner submatrix copy. 

Smart designed machine will hit high performance with this program.
I don't like to use inner submatrix copy method, because it is
not the foundamental of the problem( Only the restriction of the
number of translation buffers).

Copyright 1998 by Naohiko Shimizu 

All rights reserved.

The blocking algorithm and strategy for the blocking factor is my
original. If you want to use the same algorithm or to use the same
strategy for the blocking factor, you must preserve the copyright 
notice to the all programs and products which is derivable from this 
program and/or description.

Contact information:

Dr. Naohiko Shimizu
School of Engr. Tokai Univ.
1117 Kitakaname Hiratsuka
Kanagawa 259-1292 Japan
email: nshimizu @ keyaki . cc .u-tokai.ac.jp
URL: http://shimizu-lab.dt.u-tokai.ac.jp

You can call this subroutine from C or FORTRAN.
Only non-transposed routine is implemented.
This program is not alpha specific. You can compile and run
for many other architectures.


C:
       void dgemms_nn(int m, int n, int k, 
	     double alpha, double *a, int lda, double *b, int ldb, 
	     double beta, double *c, int ldc);

*CATION* C user should know that the matrices is assumed to be stored
in the FORTRAN order(Row Major).

FORTRAN:
       call dgemms_nn(m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)

For benchmarking I placed main C routine. Use -DMAIN option to
compile for benchmaking.

Compile options:

-DMAIN	: compile for benchmarking
-Dev6	: compile for 21264
-Dev4	: compile for 21064(A)
-DRPCC	: use RPCC instruction to measure the time

The core routine of DGEMM is matrices multiply. There are known
way to use blocking algorithm for this problem. Usually they are
intended only to the cache optimization. But for a large problem it is
required to optimize the usage of translation buffer. Every matrix 
element which is adjucent by one column will require an entry on the 
translation buffer. Then buffer shortage will easily be occured on
most RISC processors. This program is to demonstrate the blocking
algorithm to reduce both the buffer miss and cache miss on a certain
processor.

Conventional method:

Now we are going to solve a problem : C = aAB + bC.

A=[A11 A12 A13 ... ; A21  A22 A23 ...; ... ; As1 As2 ...]
B=[B11 B12 B13 ... ; B21  B22 B23 ...; ... ; Bt1 Bt2 ...]
C=[C11 C12 C13 ... ; C21  C22 C23 ...; ... ; Cs1 Cs2 ...]

Cij = bCij + a*sum(Ail*Blj for l=1 to t)

More square the Aij will result in less memory load traffic. And
usually above algorithm is just implemented to reduce store traffic.

My method:

In fortran, the row direction is continuous on memory. Then if we
calculate Ail*Blj for i=1 to n, the translation miss will be reduced.
Of course this requires more temporaly result on Cij which induce to
occure more store traffic. For many implementation of RISC machines,
the second or third cache is organized as store-in cache. Then once
store is required it must first read a cache block from memoy. And the
store traffic is almost as heavy as the double of the same amount load
traffic. Since the store occurence of that computation completely 
depends on the number of column of Ail, we can control the traffic.
The traffic by Blj can be almost ignored, because once Blj is loaded
into the cache, many of Ail*Blj can be executed with the cached data.
But the load traffic of Ail will be another major traffic for this
problem. And we should keep the portion of Ail in cache as large as
possible. But if the store traffic and the load traffic will both be
affordable for a specific machine, we have a chance to induce the 
maximum performance.

To fit in the translation buffer entry, the program should use
blocking on Ail. The total number of the row times the sizeof double
in a block should be fit in the pagesize of the processor. And the
total number of the column of Ail, Blj, Cij should not be exceed the
number of translation buffer entries.

The number of column of the Ail is arbitrary. The more will decrease 
the store traffic and increase the load traffic, the less will increase
the store traffic but decrease the load traffic.  Then there is some
optimal point for a specific machine. 

The Ail will cause load traffic from the number of column of the Ail on
every the number of column of Ail by the number of column of Blj
submatrix multiplication. The Cij will cause store traffic on the
every row of the Cij.

The parameter depends on the processor and memory system
configuration, and you can tune for your system.
But if the number of the translation buffer is small and load/store
throughput is also small, the result will be hopeless. You should
think to replace your machine.

If the Ail is aligned to the page size, it is easy to treat. But for
general we cannot expect so. I assume that the translation buffer
entries used by Ail will be doubled effectively. To be hold the
all required page in the translation buffer.
Of course this not ideal solution. You may want to align the A
to the page boundary and set LDA to the factor of the pagesize.

The algorithm will be following:

do ii=1 to k, ail_clmn_size
 do jj=1 to m, ail_maj_blk_size
  do kk=1 to n, blj_clmn_size
   do  ll=jj to jj+ail_maj_blk_size
    ctmp_ll_kk += A_ll_ii*B_ii_kk
   enddo
  enddo
 enddo
enddo

For AlphaPC164LX parameters:
  DTB 64 entries
  Page size 8KB
  load throughput  almost 700MB/S
  store throughput almost 400MB/S

Then I think the following number is affordable.

1. The number of column of Ail is 20
2. The number of column of Blj,Cij is 12
3. The blocking factor of the Ail row is 512 (just fit into the half page)

You may have a good result for 21264 with the parameter of:

1. The number of column of Ail is 32
2. The number of column of Blj,Cij is 16
3. The blocking factor of the Ail row is 512 (just fit into the half page)

ToDo:
1: Implement the inner sub-matrix copy argorithm for comparng aid.
2: Processor/System specific Ail_max, Blj_max should be developped to
extract best performance. 
3: Collect benchmark results for many architecture/systems.