Deploying LLM Inference
Targon supports GPU-backed deployments for large language models. This guide walks through the vllm_example.py provided in the SDK and highlights best practices for serving inference workloads.
Example Overview
# targon-sdk/examples/llm/vllm_example.py
import targon
vllm_image = (
targon.Image.from_registry(
"nvidia/cuda:12.8.0-devel-ubuntu22.04",
add_python="3.12"
)
.pip_install(
"vllm==0.10.2",
"torch==2.8.0",
"huggingface_hub==0.35.0",
"fastapi[standard]",
"hf_transfer"
)
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)
app = targon.App("vllm-inference", image=vllm_image)
@app.function(resource="h200-small", max_replicas=1)
@targon.web_server(port=8080)
def serve():
import subprocess
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
cmd = [
"vllm",
"serve",
"--uvicorn-log-level=info",
MODEL_NAME,
"--served-model-name",
MODEL_NAME,
"--host",
"0.0.0.0",
"--port",
"8080",
]
cmd += ["--enforce-eager"]
cmd += ["--tensor-parallel-size", "1"]
subprocess.Popen(" ".join(cmd), shell=True)
Key elements:
- Starts from an NVIDIA CUDA base image and adds Python.
- Installs
vllm, PyTorch, and Hugging Face tooling. - Sets an environment variable to speed up model downloads.
- Uses
@targon.web_serverto expose the service on port8080.
Choose GPU Resources
The example selects resource="h200-small" which maps to Compute.H200_SMALL. Pick a tier based on model size and throughput:
H200_SMALL: Small/medium models, single-GPU inference.H200_MEDIUM: Larger models or higher concurrency.H200_LARGE/H200_XL: Multi-GPU workloads or fine-tuning.
See the Compute reference for the full list.
Deploy the Service
-
Configure credentials (
targon setup). -
Deploy the script:
targon deploy targon-sdk/examples/llm/vllm_example.py --name tinyllama-demo -
Use
targon app listandtargon app functions <APP_ID>to inspect the deployment.
Model Configuration
- Update
MODEL_NAMEand revision to target different Hugging Face models. - Adjust tensor parallelism (
--tensor-parallel-size) for larger GPUs. - Add extra flags (KV cache, max tokens, etc.) to the command list.
Production Tips
- Increase
startup_timeouton@targon.web_serverwhen loading multi-gigabyte models. - Pin package versions to avoid breaking changes.
- Push pre-built weights to a private container registry to reduce cold-starts.
- Enable
requires_auth=Trueif exposing the endpoint publicly. - Combine with auto-scaling by raising
max_replicasonce you validate load patterns.
For a broader look at Targon's web options, read the Web Endpoints guide. When you're ready to optimize hardware usage, continue with the Compute Resources guide.