meta-llama/Llama-3.1-8B-Instruct
Meta's Llama 3.1 8B dense instruction-tuned language model with 128K context
View on HuggingFaceGuide
Overview
Llama 3.1 Instruct is Meta's instruction-tuned language model family. The 8B dense variant is lightweight and ideal for single-GPU deployment, with 128K context support. A 70B variant is also available (see related recipes).
Prerequisites
- Hardware: 1x GPU with >=16 GB VRAM (e.g. A100, L40S, H100, H200)
- vLLM >= 0.6.0
- CUDA Driver compatible with your vLLM version
- Docker with NVIDIA Container Toolkit (recommended)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto
TPU Deployment
Client Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)
Troubleshooting
OOM on small GPUs:
Lower --max-model-len or --gpu-memory-utilization.