Fast DGEMM routine with Blocking but without inner submatrix copy. Smart designed machine will hit high performance with this program. I don't like to use inner submatrix copy method, because it is not the foundamental of the problem( Only the restriction of the number of translation buffers). Copyright 1998 by Naohiko ShimizuAll rights reserved. The blocking algorithm and strategy for the blocking factor is my original. If you want to use the same algorithm or to use the same strategy for the blocking factor, you must preserve the copyright notice to the all programs and products which is derivable from this program and/or description. Contact information: Dr. Naohiko Shimizu School of Engr. Tokai Univ. 1117 Kitakaname Hiratsuka Kanagawa 259-1292 Japan email: nshimizu @ keyaki . cc .u-tokai.ac.jp URL: http://shimizu-lab.dt.u-tokai.ac.jp You can call this subroutine from C or FORTRAN. Only non-transposed routine is implemented. This program is not alpha specific. You can compile and run for many other architectures. C: void dgemms_nn(int m, int n, int k, double alpha, double *a, int lda, double *b, int ldb, double beta, double *c, int ldc); *CATION* C user should know that the matrices is assumed to be stored in the FORTRAN order(Row Major). FORTRAN: call dgemms_nn(m,n,k,alpha,a,lda,b,ldb,beta,c,ldc) For benchmarking I placed main C routine. Use -DMAIN option to compile for benchmaking. Compile options: -DMAIN : compile for benchmarking -Dev6 : compile for 21264 -Dev4 : compile for 21064(A) -DRPCC : use RPCC instruction to measure the time The core routine of DGEMM is matrices multiply. There are known way to use blocking algorithm for this problem. Usually they are intended only to the cache optimization. But for a large problem it is required to optimize the usage of translation buffer. Every matrix element which is adjucent by one column will require an entry on the translation buffer. Then buffer shortage will easily be occured on most RISC processors. This program is to demonstrate the blocking algorithm to reduce both the buffer miss and cache miss on a certain processor. Conventional method: Now we are going to solve a problem : C = aAB + bC. A=[A11 A12 A13 ... ; A21 A22 A23 ...; ... ; As1 As2 ...] B=[B11 B12 B13 ... ; B21 B22 B23 ...; ... ; Bt1 Bt2 ...] C=[C11 C12 C13 ... ; C21 C22 C23 ...; ... ; Cs1 Cs2 ...] Cij = bCij + a*sum(Ail*Blj for l=1 to t) More square the Aij will result in less memory load traffic. And usually above algorithm is just implemented to reduce store traffic. My method: In fortran, the row direction is continuous on memory. Then if we calculate Ail*Blj for i=1 to n, the translation miss will be reduced. Of course this requires more temporaly result on Cij which induce to occure more store traffic. For many implementation of RISC machines, the second or third cache is organized as store-in cache. Then once store is required it must first read a cache block from memoy. And the store traffic is almost as heavy as the double of the same amount load traffic. Since the store occurence of that computation completely depends on the number of column of Ail, we can control the traffic. The traffic by Blj can be almost ignored, because once Blj is loaded into the cache, many of Ail*Blj can be executed with the cached data. But the load traffic of Ail will be another major traffic for this problem. And we should keep the portion of Ail in cache as large as possible. But if the store traffic and the load traffic will both be affordable for a specific machine, we have a chance to induce the maximum performance. To fit in the translation buffer entry, the program should use blocking on Ail. The total number of the row times the sizeof double in a block should be fit in the pagesize of the processor. And the total number of the column of Ail, Blj, Cij should not be exceed the number of translation buffer entries. The number of column of the Ail is arbitrary. The more will decrease the store traffic and increase the load traffic, the less will increase the store traffic but decrease the load traffic. Then there is some optimal point for a specific machine. The Ail will cause load traffic from the number of column of the Ail on every the number of column of Ail by the number of column of Blj submatrix multiplication. The Cij will cause store traffic on the every row of the Cij. The parameter depends on the processor and memory system configuration, and you can tune for your system. But if the number of the translation buffer is small and load/store throughput is also small, the result will be hopeless. You should think to replace your machine. If the Ail is aligned to the page size, it is easy to treat. But for general we cannot expect so. I assume that the translation buffer entries used by Ail will be doubled effectively. To be hold the all required page in the translation buffer. Of course this not ideal solution. You may want to align the A to the page boundary and set LDA to the factor of the pagesize. The algorithm will be following: do ii=1 to k, ail_clmn_size do jj=1 to m, ail_maj_blk_size do kk=1 to n, blj_clmn_size do ll=jj to jj+ail_maj_blk_size ctmp_ll_kk += A_ll_ii*B_ii_kk enddo enddo enddo enddo For AlphaPC164LX parameters: DTB 64 entries Page size 8KB load throughput almost 700MB/S store throughput almost 400MB/S Then I think the following number is affordable. 1. The number of column of Ail is 20 2. The number of column of Blj,Cij is 12 3. The blocking factor of the Ail row is 512 (just fit into the half page) You may have a good result for 21264 with the parameter of: 1. The number of column of Ail is 32 2. The number of column of Blj,Cij is 16 3. The blocking factor of the Ail row is 512 (just fit into the half page) ToDo: 1: Implement the inner sub-matrix copy argorithm for comparng aid. 2: Processor/System specific Ail_max, Blj_max should be developped to extract best performance. 3: Collect benchmark results for many architecture/systems.