vLLM is an optimized inference server. It implements PagedAttention for efficient memory management.
Features:
- Continuous batching (serve multiple requests efficiently)
- -x higher throughput than naive serving
- OpenAI-compatible API
- Quantization support
vLLM is the standard choice for production GPU serving.