Tensor cores accelerate matrix multiplication in reduced precision. They're why BF16 and FP16 training is so fast.
A single tensor core operation can multiply two x matrices in one cycle. Standard CUDA cores take many cycles for the same operation.
To use tensor cores, dimensions should be multiples of (for FP16) or (for FP8). Frameworks handle this automatically, but it explains some performance quirks.