When a model doesn't fit on one GPU, tensor parallelism splits layers across multiple GPUs. Each GPU holds a slice of each layer's weights.
For a B model across GPUs, each GPU holds roughly B parameters. GPUs communicate during each forward pass to combine partial results.
Tensor parallelism adds latency due to inter-GPU communication. Use the minimum GPUs needed to fit your model. For inference, it's often better to quantize first if possible.