meta-llama/Llama-3.1-8B-Instruct

Meta's Llama 3.1 8B dense instruction-tuned language model with 128K context

dense8B131,072 ctxvLLM 0.6.0+text

Guide

Overview

Llama 3.1 Instruct is Meta's instruction-tuned language model family. The 8B dense variant is lightweight and ideal for single-GPU deployment, with 128K context support. A 70B variant is also available (see related recipes).

TPU support is provided through vLLM TPU with a recipe for Trillium.

Prerequisites

Hardware: 1x GPU with >=16 GB VRAM (e.g. A100, L40S, H100, H200) or 2x Xeon6/Xeon5 NUMA node
vLLM >= 0.6.0
CUDA Driver compatible with your vLLM version for 1x GPU
Docker with NVIDIA Container Toolkit (recommended) for 1x GPU

Install vLLM

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto

pip (Intel Xeon 6 CPUs)

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

Docker (Intel Xeon 6 CPUs)

docker pull vllm/vllm-openai-cpu:latest-x86_64 # For Intel Xeon 6

Docker (Cloud TPU — Trillium)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium recipe, then run:

docker run -itd --name llama31-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
  vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 2048 \
    --host 0.0.0.0 --port 8000

The 8B fits on a single chip (v6e-1, TP=1) — unlike the 70B, which needs a v6e-8 slice.

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for meta-llama/Llama-3.1-8B-Instruct:

docker run -itd --name llama3-8b-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Client Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)

Speculative Decoding (EAGLE3)

An EAGLE3 draft head is available at RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3. Enable the Spec Decoding toggle above, or add the --speculative-config manually:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"model":"RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3","method":"eagle3","num_speculative_tokens":3}'

EAGLE3 verification requires vLLM >= 0.9.0. num_speculative_tokens of 3 is a reasonable starting point for chat workloads.

Troubleshooting

OOM on small GPUs: Lower --max-model-len or --gpu-memory-utilization.

EAGLE3 draft head not loading: Upgrade vLLM to >= 0.9.0 — earlier releases don't support the EAGLE3 verification path.