Deploying DeepSeek-32B with Persistent Volumes
Large language models like DeepSeek-R1-Distill-Qwen-32B require significant computational resources and storage. When deploying these models for inference, you need a way to persist model weights across container restarts and efficiently manage data that can be tens of gigabytes in size.
In this guide, we'll walk through deploying the DeepSeek-32B model on Targon Rentals using SGLang, with model weights stored on persistent volumes. Targon containers are not stateful by default—any data written to the container's filesystem will be lost when the container is recreated. By using volumes, we ensure that your model weights persist independently of the container lifecycle, eliminating the need to re-download them each time and significantly reducing startup time.
Why Rentals for LLM Inference?
Targon Rentals provide the ideal environment for hosting LLM inference models because they offer:
- Full Control: Root access and SSH connectivity for debugging and customization
- GPU Access: Dedicated GPU resources for accelerated inference
- Volume Support: Persistent storage that survives container lifecycle—this is essential since containers are not stateful
- Long-Running Processes: Stable environment for continuous model serving
Unlike serverless deployments that spin down after inactivity, Rentals keep your model loaded and ready to serve requests. However, containers themselves are not stateful—any data written to the container filesystem will be lost when the container is recreated. This is why volumes are critical: they provide the persistence layer that allows your model weights and other important data to survive across container restarts, making them perfect for production inference workloads.
Prerequisites
Before you begin, make sure you have:
- A Targon account with access to the Dashboard
- An SSH key pair for connecting to your rental
- Basic familiarity with Docker and container concepts
- Understanding of LLM inference concepts (helpful but not required)
Important Note: Targon containers are not stateful. Any data written to the container's filesystem will be lost when the container is recreated. This guide shows you how to use volumes to persist your model weights and other important data.
Step 1: Create a Volume for Model Weights
The first step is to create a persistent volume to store your model weights. This ensures that once the model is downloaded, it will persist even if the container is recreated.
Creating the Volume
-
Navigate to the Targon Dashboard
-
Go to the Volumes section
-
Click Create Volume
-
Configure your volume:
- Name: Give it a descriptive name like
deepseek-32b-weights - Size: For a 32B model, we recommend at least 100GB to accommodate the model weights and any additional cache files
- Name: Give it a descriptive name like
-
Click Create Volume to provision your persistent storage
Note: The volume size can be resized later if needed, but it's better to allocate sufficient space upfront to avoid interruptions.
Why Volumes Matter
Model weights for large language models can be 60GB or more. Since Targon containers are not stateful, without a persistent volume:
- Every container restart or recreation would require re-downloading the entire model
- Any data written to the container filesystem would be lost
- Download times can take hours depending on network speed
- Startup times can be longer
By using a volume, the model weights are stored independently of the container lifecycle. This means even if the container is deleted and recreated, your model weights remain safe on the volume, allowing for fast restarts and significant cost savings.
Step 2: Create Your Rental
Now let's create a GPU rental configured to run the DeepSeek-32B model with SGLang.
Navigate to Rentals
- In the Dashboard, go to the Rentals section
- Click Create Rental
Select Hardware Configuration
- Choose GPU as your hardware type
- Select an appropriate GPU configuration. For the DeepSeek-32B model, we recommend:
- NVIDIA H200 - Small configurations for optimal performance
- The specific tier depends on your throughput requirements
Choose Configuration Type
Select Targon Ubuntu. This is a professional, feature-rich Ubuntu 24.04 LTS development environment designed to provide a VM-like experience within a Docker container. This image comes pre-configured with essential Linux utilities, modern CLI tools, Neovim, and a beautifully customized terminal interface.
With Targon Ubuntu, you don't need to configure Docker images, commands, or arguments—you'll set up SGLang manually after connecting via SSH.
Configure Service Port
In the Service Ports section:
- Port:
30000 - This is the port where your SGLang instance will run and be exposed for external connections
- You will receive a domain name to access this port on your rental
Mount Your Volume
This is the critical step for persistence. Since containers are not stateful, you must mount a volume to preserve your model weights:
- In the Mount Storage section, select your previously created volume
- Set the Mount Point to:
/root
By mounting the volume to /root, the entire root directory (including /root/.cache/huggingface where Hugging Face stores downloaded models) will be stored on the persistent volume. This approach gives you flexibility to store other data in the /root directory as well, all of which will persist across container recreations.
When the SGLang container runs, it will automatically use /root/.cache/huggingface for model storage, which will now be on your persistent volume rather than in the container's ephemeral filesystem. Mounting to /root instead of a more specific path like /root/.cache/huggingface provides more flexibility—you can store additional files, configurations, or data in the /root directory, and everything will persist.
Without this volume mount, any downloaded model weights would be lost when the container is recreated.
Advanced Options
In the advanced options:
- Environment Variables: Not required for this setup
- Default Shell: Choose your preferred shell (e.g.,
/bin/bashor/bin/sh) for when you SSH into the rental
Choose your preferred shell (e.g., /bin/bash or /bin/sh) for when you SSH into the rental.
Add SSH Key
- Add your SSH public key to enable secure access
- You can use an existing key or add a new one
Name Your Rental
Give your rental a descriptive name like deepseek-32b-inference to easily identify it later.
Deploy
Click Create Rental to deploy your configuration. The rental will be available in a few seconds.
Step 3: Connect and Verify
Once your rental is running, let's connect to it and verify everything is working correctly.
SSH into Your Rental
- Go to your rental's detail page in the Dashboard
- Copy the SSH command (it will look like:
ssh rentals-abc123@ssh.deployments.targon.com) - Open your terminal and run the command
ssh rentals-abc123@ssh.deployments.targon.com
Install and Run SGLang
Since we're using Targon Ubuntu, you'll need to run SGLang using Docker. First, check which GPU devices are available:
nvidia-smi
This will show you the available GPUs and their device numbers. Once connected via SSH, run the following command to start the SGLang server (using device 0 as shown, but adjust if needed):
docker run -d \
--runtime nvidia \
-e NVIDIA_VISIBLE_DEVICES=0 \
--shm-size 32g \
-p 30000:30000 \
-v /root/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code
This command:
- Runs SGLang in a Docker container with NVIDIA GPU support
- Mounts your persistent volume at
/root/.cache/huggingface(which is already mounted at/rootin the rental) - Exposes the server on port 30000
- Downloads and loads the DeepSeek-32B model on first run
Note: The volume mount path in the Docker command (
/root/.cache/huggingface) maps to the same location where Hugging Face will store the model, which is within your mounted volume at/root.
Understanding the Docker Command
Let's break down the Docker command and how each parameter works:
docker run -d \
--runtime nvidia \
-e NVIDIA_VISIBLE_DEVICES=0 \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code
Parameter Breakdown:
| Docker Flag | Purpose |
|---|---|
--runtime nvidia | Enables NVIDIA GPU runtime for model inference |
-e NVIDIA_VISIBLE_DEVICES=0 | Specifies which GPU device to use (device 0). You can find available GPU devices by running nvidia-smi |
--shm-size 32g | Allocates 32GB shared memory for model loading |
-p 30000:30000 | Exposes port 30000 for external access |
-v ~/.cache/huggingface:/root/.cache/huggingface | Mounts the volume to /root/.cache/huggingface where Hugging Face stores models |
--ipc=host | Enables host IPC for inter-process communication |
lmsysorg/sglang:latest | Base container image with SGLang |
python3 -m sglang.launch_server ... | Startup command and parameters for the SGLang server |
Verify Model Download
On first startup, the model will be downloaded from Hugging Face. You can monitor the download progress:
# Check if the model is being downloaded
ls -lh /root/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-32B/
# Monitor disk usage of the volume
df -h /root/.cache/huggingface
The initial download may take 30-60 minutes depending on network speed, as the model weights are approximately 60GB.
Verify Volume Mount
Confirm that your volume is properly mounted:
# Check mount points - the volume should be mounted at /root
mount | grep /root
# Verify the mount point exists and is writable
touch /root/test && rm /root/test
echo "Volume is writable"
# Verify the Hugging Face cache directory exists (will be created on first model download)
ls -la /root/.cache/huggingface
Test the Inference Endpoint
Once the model is loaded (you'll see a message indicating the server is ready), you can test the inference endpoint. You can access your application using the domain name provided for port 30000 on your rental.
From your local machine (using the domain name):
curl -X POST http://your-rental-domain/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
"messages": [{"role": "user", "content": "Hello, how are you?"}]
}'
From the rental terminal (using localhost):
curl -X POST http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
"messages": [{"role": "user", "content": "Hello, how are you?"}]
}'
Replace your-rental-domain with the actual domain name provided for your rental's port 30000. You can find this domain in your rental's detail page in the Targon dashboard.
Note: The exact API endpoint format may vary depending on SGLang version. Consult the SGLang documentation for the correct API format.
Step 4: Benefits of Volume Persistence
Now that your model is deployed with persistent volumes, let's explore the benefits:
Fast Restarts
After the initial model download, subsequent container restarts or recreations are much faster because:
- Model weights are already stored on the volume (not in the container filesystem)
- No need to re-download 60GB+ of data
- Container can start serving requests within minutes instead of hours
- The volume persists independently, so even if you delete and recreate the rental, your weights remain
Data Safety
Your model weights are safe because:
- Volumes exist independently of containers
- Data persists even if a container crashes
Easy Updates
When you need to update your deployment:
- Keep the same volume with model weights
- Create a new rental with updated configuration
- Mount the existing volume to the new rental
- No need to re-download the model
When to Use Rentals vs Serverless
Rentals are ideal for testing and benchmarking multiple inference engines. They provide a dedicated, persistent environment where you can:
- Experiment with different inference frameworks (SGLang, vLLM, TensorRT-LLM, etc.)
- Compare performance across different configurations
- Fine-tune parameters and optimize your setup
- Test model compatibility before production deployment
Once you've identified the optimal inference engine and configuration, you can efficiently deploy it to production using Targon's serverless service, which provides:
- Automatic scaling based on demand
- Cost optimization through pay-per-use pricing
- Built-in load balancing and high availability
- Simplified deployment and management
Alternatively, if you're looking for a ready-to-use inference service without managing infrastructure, you can use Sybil.com, which is powered by Targon and provides managed LLM inference capabilities.
Conclusion
Deploying large language models on Targon Rentals with persistent volumes provides a robust, cost-effective solution for production inference workloads. By following this guide, you've learned how to:
- Create persistent volumes for model weight storage
- Configure GPU rentals with custom Docker images
- Set up SGLang for efficient model serving
- Ensure data persistence across container restarts
- Optimize your deployment for performance and cost
The combination of Rentals' dedicated, long-running environment and Volumes' independent persistent storage creates an ideal platform for hosting LLM inference services. While containers themselves are not stateful, volumes provide the persistence layer needed to maintain model weights and other critical data across container lifecycles, enabling fast, reliable responses.
Next Steps
- Explore other Rental use cases for testing and benchmarking inference engines
- Learn more about Volume management
- Deploy to production with Serverless for autoscaling and cost optimization
- Use Sybil.com for managed LLM inference powered by Targon
- Review Compute resources to optimize your GPU selection
Additional Resources
Happy deploying!