Deploying DeepSeek-32B with Persistent Volumes

Large language models like DeepSeek-R1-Distill-Qwen-32B require significant computational resources and storage. When deploying these models for inference, you need a way to persist model weights across container restarts and efficiently manage data that can be tens of gigabytes in size.

In this guide, we'll walk through deploying the DeepSeek-32B model on Targon Rentals using SGLang, with model weights stored on persistent volumes. Targon containers are not stateful by default—any data written to the container's filesystem will be lost when the container is recreated. By using volumes, we ensure that your model weights persist independently of the container lifecycle, eliminating the need to re-download them each time and significantly reducing startup time.

Why Rentals for LLM Inference?

Targon Rentals provide the ideal environment for hosting LLM inference models because they offer:

Full Control: Root access and SSH connectivity for debugging and customization
GPU Access: Dedicated GPU resources for accelerated inference
Volume Support: Persistent storage that survives container lifecycle—this is essential since containers are not stateful
Long-Running Processes: Stable environment for continuous model serving

Unlike serverless deployments that spin down after inactivity, Rentals keep your model loaded and ready to serve requests. However, containers themselves are not stateful—any data written to the container filesystem will be lost when the container is recreated. This is why volumes are critical: they provide the persistence layer that allows your model weights and other important data to survive across container restarts, making them perfect for production inference workloads.

Prerequisites

Before you begin, make sure you have:

A Targon account with access to the Dashboard
An SSH key pair for connecting to your rental
Basic familiarity with Docker and container concepts
Understanding of LLM inference concepts (helpful but not required)

Important Note: Targon containers are not stateful. Any data written to the container's filesystem will be lost when the container is recreated. This guide shows you how to use volumes to persist your model weights and other important data.

Step 1: Create a Volume for Model Weights

The first step is to create a persistent volume to store your model weights. This ensures that once the model is downloaded, it will persist even if the container is recreated.

Creating the Volume

Navigate to the Targon Dashboard
Go to the Volumes section
Click Create Volume
Configure your volume:
- Name: Give it a descriptive name like deepseek-32b-weights
- Size: For a 32B model, we recommend at least 100GB to accommodate the model weights and any additional cache files
Click Create Volume to provision your persistent storage

Note: The volume size can be resized later if needed, but it's better to allocate sufficient space upfront to avoid interruptions.

Why Volumes Matter

Model weights for large language models can be 60GB or more. Since Targon containers are not stateful, without a persistent volume:

Every container restart or recreation would require re-downloading the entire model
Any data written to the container filesystem would be lost
Download times can take hours depending on network speed
Startup times can be longer

By using a volume, the model weights are stored independently of the container lifecycle. This means even if the container is deleted and recreated, your model weights remain safe on the volume, allowing for fast restarts and significant cost savings.

Step 2: Create Your Rental

Now let's create a GPU rental configured to run the DeepSeek-32B model with SGLang.

Navigate to Rentals

In the Dashboard, go to the Rentals section
Click Create Rental

Select Hardware Configuration

Choose GPU as your hardware type
Select an appropriate GPU configuration. For the DeepSeek-32B model, we recommend:
- NVIDIA H200 - Small configurations for optimal performance
- The specific tier depends on your throughput requirements

Choose Configuration Type

Select Targon Ubuntu. This is a professional, feature-rich Ubuntu 24.04 LTS development environment designed to provide a VM-like experience within a Docker container. This image comes pre-configured with essential Linux utilities, modern CLI tools, Neovim, and a beautifully customized terminal interface.

With Targon Ubuntu, you don't need to configure Docker images, commands, or arguments—you'll set up SGLang manually after connecting via SSH.

Configure Service Port

In the Service Ports section:

Port: 30000
This is the port where your SGLang instance will run and be exposed for external connections
You will receive a domain name to access this port on your rental

Mount Your Volume

This is the critical step for persistence. Since containers are not stateful, you must mount a volume to preserve your model weights:

In the Mount Storage section, select your previously created volume
Set the Mount Point to: /root

By mounting the volume to /root, the entire root directory (including /root/.cache/huggingface where Hugging Face stores downloaded models) will be stored on the persistent volume. This approach gives you flexibility to store other data in the /root directory as well, all of which will persist across container recreations.

When the SGLang container runs, it will automatically use /root/.cache/huggingface for model storage, which will now be on your persistent volume rather than in the container's ephemeral filesystem. Mounting to /root instead of a more specific path like /root/.cache/huggingface provides more flexibility—you can store additional files, configurations, or data in the /root directory, and everything will persist.

Without this volume mount, any downloaded model weights would be lost when the container is recreated.

Advanced Options

In the advanced options:

Environment Variables: Not required for this setup
Default Shell: Choose your preferred shell (e.g., /bin/bash or /bin/sh) for when you SSH into the rental

Choose your preferred shell (e.g., /bin/bash or /bin/sh) for when you SSH into the rental.

Add SSH Key

Add your SSH public key to enable secure access
You can use an existing key or add a new one

Name Your Rental

Give your rental a descriptive name like deepseek-32b-inference to easily identify it later.

Deploy

Click Create Rental to deploy your configuration. The rental will be available in a few seconds.

Step 3: Connect and Verify

Once your rental is running, let's connect to it and verify everything is working correctly.

SSH into Your Rental

Go to your rental's detail page in the Dashboard
Copy the SSH command (it will look like: ssh rentals-abc123@ssh.deployments.targon.com)
Open your terminal and run the command

ssh rentals-abc123@ssh.deployments.targon.com

Install and Run SGLang

Since we're using Targon Ubuntu, you'll need to run SGLang using Docker. First, check which GPU devices are available:

nvidia-smi

This will show you the available GPUs and their device numbers. Once connected via SSH, run the following command to start the SGLang server (using device 0 as shown, but adjust if needed):

docker run -d \
  --runtime nvidia \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  --shm-size 32g \
  -p 30000:30000 \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  --ipc=host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 30000 \
    --trust-remote-code

This command:

Runs SGLang in a Docker container with NVIDIA GPU support
Mounts your persistent volume at /root/.cache/huggingface (which is already mounted at /root in the rental)
Exposes the server on port 30000
Downloads and loads the DeepSeek-32B model on first run

Note: The volume mount path in the Docker command (/root/.cache/huggingface) maps to the same location where Hugging Face will store the model, which is within your mounted volume at /root.

Understanding the Docker Command

Let's break down the Docker command and how each parameter works:

docker run -d \
  --runtime nvidia \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --ipc=host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 30000 \
    --trust-remote-code

Parameter Breakdown:

Docker Flag	Purpose
`--runtime nvidia`	Enables NVIDIA GPU runtime for model inference
`-e NVIDIA_VISIBLE_DEVICES=0`	Specifies which GPU device to use (device 0). You can find available GPU devices by running `nvidia-smi`
`--shm-size 32g`	Allocates 32GB shared memory for model loading
`-p 30000:30000`	Exposes port 30000 for external access
`-v ~/.cache/huggingface:/root/.cache/huggingface`	Mounts the volume to `/root/.cache/huggingface` where Hugging Face stores models
`--ipc=host`	Enables host IPC for inter-process communication
`lmsysorg/sglang:latest`	Base container image with SGLang
`python3 -m sglang.launch_server ...`	Startup command and parameters for the SGLang server

Verify Model Download

On first startup, the model will be downloaded from Hugging Face. You can monitor the download progress:

# Check if the model is being downloaded
ls -lh /root/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-32B/

# Monitor disk usage of the volume
df -h /root/.cache/huggingface

The initial download may take 30-60 minutes depending on network speed, as the model weights are approximately 60GB.

Verify Volume Mount

Confirm that your volume is properly mounted:

# Check mount points - the volume should be mounted at /root
mount | grep /root

# Verify the mount point exists and is writable
touch /root/test && rm /root/test
echo "Volume is writable"

# Verify the Hugging Face cache directory exists (will be created on first model download)
ls -la /root/.cache/huggingface

Test the Inference Endpoint

Once the model is loaded (you'll see a message indicating the server is ready), you can test the inference endpoint. You can access your application using the domain name provided for port 30000 on your rental.

From your local machine (using the domain name):

curl -X POST http://your-rental-domain/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "messages": [{"role": "user", "content": "Hello, how are you?"}]
  }'

From the rental terminal (using localhost):

curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "messages": [{"role": "user", "content": "Hello, how are you?"}]
  }'

Replace your-rental-domain with the actual domain name provided for your rental's port 30000. You can find this domain in your rental's detail page in the Targon dashboard.

Note: The exact API endpoint format may vary depending on SGLang version. Consult the SGLang documentation for the correct API format.

Step 4: Benefits of Volume Persistence

Now that your model is deployed with persistent volumes, let's explore the benefits:

Fast Restarts

After the initial model download, subsequent container restarts or recreations are much faster because:

Model weights are already stored on the volume (not in the container filesystem)
No need to re-download 60GB+ of data
Container can start serving requests within minutes instead of hours
The volume persists independently, so even if you delete and recreate the rental, your weights remain

Data Safety

Your model weights are safe because:

Volumes exist independently of containers
Data persists even if a container crashes

Easy Updates

When you need to update your deployment:

Keep the same volume with model weights
Create a new rental with updated configuration
Mount the existing volume to the new rental
No need to re-download the model

When to Use Rentals vs Serverless

Rentals are ideal for testing and benchmarking multiple inference engines. They provide a dedicated, persistent environment where you can:

Experiment with different inference frameworks (SGLang, vLLM, TensorRT-LLM, etc.)
Compare performance across different configurations
Fine-tune parameters and optimize your setup
Test model compatibility before production deployment

Once you've identified the optimal inference engine and configuration, you can efficiently deploy it to production using Targon's serverless service, which provides:

Automatic scaling based on demand
Cost optimization through pay-per-use pricing
Built-in load balancing and high availability
Simplified deployment and management

Alternatively, if you're looking for a ready-to-use inference service without managing infrastructure, you can use Sybil.com, which is powered by Targon and provides managed LLM inference capabilities.

Conclusion

Deploying large language models on Targon Rentals with persistent volumes provides a robust, cost-effective solution for production inference workloads. By following this guide, you've learned how to:

Create persistent volumes for model weight storage
Configure GPU rentals with custom Docker images
Set up SGLang for efficient model serving
Ensure data persistence across container restarts
Optimize your deployment for performance and cost

The combination of Rentals' dedicated, long-running environment and Volumes' independent persistent storage creates an ideal platform for hosting LLM inference services. While containers themselves are not stateful, volumes provide the persistence layer needed to maintain model weights and other critical data across container lifecycles, enabling fast, reliable responses.

Next Steps

Explore other Rental use cases for testing and benchmarking inference engines
Learn more about Volume management
Deploy to production with Serverless for autoscaling and cost optimization
Use Sybil.com for managed LLM inference powered by Targon
Review Compute resources to optimize your GPU selection

Additional Resources

Happy deploying!

Why Rentals for LLM Inference?​

Prerequisites​

Step 1: Create a Volume for Model Weights​

Creating the Volume​

Why Volumes Matter​

Step 2: Create Your Rental​

Navigate to Rentals​

Select Hardware Configuration​

Choose Configuration Type​

Configure Service Port​

Mount Your Volume​

Advanced Options​

Add SSH Key​

Name Your Rental​

Deploy​

Step 3: Connect and Verify​

SSH into Your Rental​

Install and Run SGLang​

Understanding the Docker Command​

Verify Model Download​

Verify Volume Mount​

Test the Inference Endpoint​

Step 4: Benefits of Volume Persistence​

Fast Restarts​

Data Safety​

Easy Updates​

When to Use Rentals vs Serverless​

Conclusion​

Next Steps​

Additional Resources​