Deploying LLM Inference

Targon supports GPU-backed deployments for large language models. This guide walks through the vllm_example.py provided in the SDK and highlights best practices for serving inference workloads.

Example Overview

# targon-sdk/examples/llm/vllm_example.py
import targon

vllm_image = (
    targon.Image.from_registry(
        "nvidia/cuda:12.8.0-devel-ubuntu22.04",
        add_python="3.12"
    )
    .pip_install(
        "vllm==0.10.2",
        "torch==2.8.0",
        "huggingface_hub==0.35.0",
        "fastapi[standard]",
        "hf_transfer"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

app = targon.App("vllm-inference", image=vllm_image)

@app.function(resource="h200-small", max_replicas=1)
@targon.web_server(port=8080)
def serve():
    import subprocess
    MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

    cmd = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        MODEL_NAME,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        "8080",
    ]

    cmd += ["--enforce-eager"]
    cmd += ["--tensor-parallel-size", "1"]

    subprocess.Popen(" ".join(cmd), shell=True)

Key elements:

Starts from an NVIDIA CUDA base image and adds Python.
Installs vllm, PyTorch, and Hugging Face tooling.
Sets an environment variable to speed up model downloads.
Uses @targon.web_server to expose the service on port 8080.

Choose GPU Resources

The example selects resource="h200-small" which maps to Compute.H200_SMALL. Pick a tier based on model size and throughput:

H200_SMALL: Small/medium models, single-GPU inference.
H200_MEDIUM: Larger models or higher concurrency.
H200_LARGE / H200_XL: Multi-GPU workloads or fine-tuning.

See the Compute reference for the full list.

Deploy the Service

Configure credentials (targon setup).

Deploy the script:

targon deploy targon-sdk/examples/llm/vllm_example.py --name tinyllama-demo

Use targon app list and targon app functions <APP_ID> to inspect the deployment.

Model Configuration

Update MODEL_NAME and revision to target different Hugging Face models.
Adjust tensor parallelism (--tensor-parallel-size) for larger GPUs.
Add extra flags (KV cache, max tokens, etc.) to the command list.

Production Tips

Increase startup_timeout on @targon.web_server when loading multi-gigabyte models.
Pin package versions to avoid breaking changes.
Push pre-built weights to a private container registry to reduce cold-starts.
Enable requires_auth=True if exposing the endpoint publicly.
Combine with auto-scaling by raising max_replicas once you validate load patterns.

For a broader look at Targon's web options, read the Web Endpoints guide. When you're ready to optimize hardware usage, continue with the Compute Resources guide.

Example Overview​

Choose GPU Resources​

Deploy the Service​

Model Configuration​

Production Tips​

Example Overview

Choose GPU Resources

Deploy the Service

Model Configuration

Production Tips