Skip to main content

Your First Deployment

In this guide, we'll walk you through the steps to deploy the Doubleword Inference Stack using Helm on a Kubernetes cluster. Customization and advanced configurations will not be covered in this introductory guide.

Installing the Helm Chart

The Inference Stack Helm chart is distributed through GitHub Container Registry as an OCI artifact. First, create a dedicated namespace for your deployment:

kubectl create namespace inference-stack

Install the chart directly from the OCI registry (this will require you to authenticate. Contact your account manager if you don't have access):

helm install inference-stack oci://ghcr.io/doublewordai/inference-stack \
--namespace inference-stack \
--values values.yaml

Configuration with Values

Create a values.yaml file to customize your deployment. This file contains the essential configuration for your Control Layer installation.

Model Groups

Models are added under the modelGroups key in the values.yaml file. You can transparently run any inference container and they will be managed by the Inference Stack.

  • image: Container image configuration
  • modelAlias: A list of strings that will be used to route requests to this model.
  • modelName: The name of the model the container expects to receive.
  • command: The command to run the model server.
modelGroups:
# vLLM deployment serving Llama models
vllm-example:
enabled: true

image:
repository: vllm/vllm-openai
tag: latest
pullPolicy: Always

modelAlias:
- "llama"
- "llama-3.1-8b-instruct"
modelName: "meta-llama/Meta-Llama-3.1-8B-Instruct"

command:
- "vllm"
- "serve"
- "--model"
- "meta-llama/Meta-Llama-3.1-8B-Instruct"

# SGLang deployment serving Qwen models
sglang-qwen:
enabled: true
image:
repository: lmsysorg/sglang
tag: latest
pullPolicy: Always

modelAlias:
- "qwen"
- "qwen-2.5-7b-instruct"
modelName: "Qwen/Qwen2.5-7B-Instruct"

command:
- "python"
- "-m"
- "sglang.launch_server"
- "--model-path"
- "Qwen/Qwen2.5-7B-Instruct"

GPU Configuration

For GPU-enabled inference you can assign these resources to your models:

modelGroups:
vllm-example:
resources:
limits:
nvidia.com/gpu: 2
memory: 16Gi
requests:
nvidia.com/gpu: 2
memory: 8Gi

nodeSelector:
accelerator: nvidia-h100

tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule

Shared Memory

If you are running models in tensor parallel you will need to increase the default shared memory size. You can do this by adding an emptyDir volume for shared memory:

modelGroups:
vllm-example:
volumeMounts:
- name: shm
mountPath: /dev/shm

volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 20Gi

Gated Model Weights

note

If you are using bit harbor initContainers to speed up model downloads this is not necessary.

Some model providers gate access to their models using API keys and you need to supply these to your inference container so they can download the weights. You can do this by creating a Kubernetes secret and referencing it in your values.yaml file.

modelGroups:
vllm-example:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HUGGING_FACE_HUB_TOKEN

Which pulls your Hugging Face access token from a kubernetes secret which can be created using:

kubectl create secret generic hf-secret --from-literal=HUGGING_FACE_HUB_TOKEN=<your_token>