meta-llama/Llama-3.1-8B-Instruct
Meta's Llama 3.1 8B dense instruction-tuned language model with 128K context
View on HuggingFaceGuide
Overview
Llama 3.1 Instruct is Meta's instruction-tuned language model family. The 8B dense variant is lightweight and ideal for single-GPU deployment, with 128K context support. A 70B variant is also available (see related recipes).
Prerequisites
- Hardware: 1x GPU with >=16 GB VRAM (e.g. A100, L40S, H100, H200)
- vLLM >= 0.6.0
- CUDA Driver compatible with your vLLM version
- Docker with NVIDIA Container Toolkit (recommended)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto
TPU Deployment
Client Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)
Speculative Decoding (EAGLE3)
An EAGLE3 draft head is available at
RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3.
Enable the Spec Decoding toggle above, or add the --speculative-config manually:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{"model":"RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3","method":"eagle3","num_speculative_tokens":3}'
EAGLE3 verification requires vLLM >= 0.9.0. num_speculative_tokens of 3 is a reasonable
starting point for chat workloads.
Troubleshooting
OOM on small GPUs:
Lower --max-model-len or --gpu-memory-utilization.
EAGLE3 draft head not loading: Upgrade vLLM to >= 0.9.0 — earlier releases don't support the EAGLE3 verification path.