
For much of the duration of the execution time of your kernel, the tensor core units across the device are idle.

It does not matter how much of this you do, it will not be efficient. In particular your paradigm is load-compute-unload. The construction of your kernel represents a fairly naive usage of the wmma functionality. (This is on a GTX2070 (sm 7.5) mobile card (Razer Blade 15)) What is wrong here? How do I come anywhere near the 100 TFlops (or whatever) that cublas manages? Any tips or thoughts are very welcome! By choosing different combinations of row/col major for the matrices, I can make it change the stall reasons around a bit, but I cannot get issued warps over 0.2. If I switch so that the A matrix is also in shared, it stalls in the same place, but reports “MIO throttle” as the main offender. The largest stall reason by far is “Stall LG throttle” (~18 cycles/intruction), and looking at the source I find that it stalls mainly on the “load_matrix_sync(…d_A…)” instruction. Looking at the “Scheduler Statistics” in nsight, I find that 6 warps are active, but only 0.45 are eligible and 0.18 are issued. However, when profiling the code in Nsight, it takes 145us (very approximately 2TFlops if I didn’t mess up my calculations).

I would have expected this to run very quickly, as the global memory access is always cached (which I have verified in Nsight: 99.98% L1 hitrate). (For simpler timing, no results are written, but I have verified that the SASS code is correct and does not optimize away anything). So the kernel simply multiplies two matrices, one from global memory and the other from shared, and stores the result in shared memory. Wmma::store_matrix_sync( d_C, acc_frag, 16, wmma::mem_row_major ) įor ( auto& a : A ) a = float( rand() ) / RAND_MAX įor ( auto& b : B ) b = float( rand() ) / RAND_MAX ĬudaMalloc( &d_A, 16 * 16 * sizeof( dtype ) ) ĬudaMalloc( &d_B, 16 * 16 * sizeof( dtype ) ) ĬudaMalloc( &d_C, 16 * 16 * sizeof( dtype ) ) ĬudaMemcpy( d_A, A.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice ) ĬudaMemcpy( d_B, B.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice ) ĬudaMemcpy( d_C, C.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice )

Wmma::mma_sync( acc_frag, a_frag, b_frag, acc_frag ) Wmma::load_matrix_sync( a_frag, d_A, 16 ) _global_ void test_wmma( _half* d_A, _half* d_B, _half* d_C ) In the hope that someone here can help me understand what I am doing wrong, I will post a small repro-case here. I am attempting to use the tensor cores efficiently in a custom DL inference kernel, but I get very poor performance.
