Skip to main content

Deploying LLM Inference

Targon supports GPU-backed deployments for large language models. This guide walks through the vllm_example.py provided in the SDK and highlights best practices for serving inference workloads.

Example Overview

# targon-sdk/examples/llm/vllm_example.py
import targon

vllm_image = (
targon.Image.from_registry(
"nvidia/cuda:12.8.0-devel-ubuntu22.04",
add_python="3.12"
)
.pip_install(
"vllm==0.10.2",
"torch==2.8.0",
"huggingface_hub==0.35.0",
"fastapi[standard]",
"hf_transfer"
)
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

app = targon.App("vllm-inference", image=vllm_image)

@app.function(resource="h200-small", max_replicas=1)
@targon.web_server(port=8080)
def serve():
import subprocess
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

cmd = [
"vllm",
"serve",
"--uvicorn-log-level=info",
MODEL_NAME,
"--served-model-name",
MODEL_NAME,
"--host",
"0.0.0.0",
"--port",
"8080",
]

cmd += ["--enforce-eager"]
cmd += ["--tensor-parallel-size", "1"]

subprocess.Popen(" ".join(cmd), shell=True)

Key elements:

  • Starts from an NVIDIA CUDA base image and adds Python.
  • Installs vllm, PyTorch, and Hugging Face tooling.
  • Sets an environment variable to speed up model downloads.
  • Uses @targon.web_server to expose the service on port 8080.

Choose GPU Resources

The example selects resource="h200-small" which maps to Compute.H200_SMALL. Pick a tier based on model size and throughput:

  • H200_SMALL: Small/medium models, single-GPU inference.
  • H200_MEDIUM: Larger models or higher concurrency.
  • H200_LARGE / H200_XL: Multi-GPU workloads or fine-tuning.

See the Compute reference for the full list.

Deploy the Service

  1. Configure credentials (targon setup).

  2. Deploy the script:

    targon deploy targon-sdk/examples/llm/vllm_example.py --name tinyllama-demo
  3. Use targon app list and targon app functions <APP_ID> to inspect the deployment.

Model Configuration

  • Update MODEL_NAME and revision to target different Hugging Face models.
  • Adjust tensor parallelism (--tensor-parallel-size) for larger GPUs.
  • Add extra flags (KV cache, max tokens, etc.) to the command list.

Production Tips

  • Increase startup_timeout on @targon.web_server when loading multi-gigabyte models.
  • Pin package versions to avoid breaking changes.
  • Push pre-built weights to a private container registry to reduce cold-starts.
  • Enable requires_auth=True if exposing the endpoint publicly.
  • Combine with auto-scaling by raising max_replicas once you validate load patterns.

For a broader look at Targon's web options, read the Web Endpoints guide. When you're ready to optimize hardware usage, continue with the Compute Resources guide.