Skip to content

Mixtral 8x7B

This example demonstrates how to deploy Mixtral with dstack's services.

Define the configuration

To deploy Mixtral as a service, you have to define the corresponding configuration file. Below are multiple variants: via vLLM (fp16), TGI (fp16), or TGI (int4).

type: service

image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1
commands:
  - text-generation-launcher 
    --port 80
    --trust-remote-code
    --num-shard 2 # Should match the number of GPUs 
port: 80

resources:
  gpu: 80GB:2
  disk: 200GB

# (Optional) Enable the OpenAI-compatible endpoint
model:
  type: chat
  name: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
  format: tgi

type: service

image: ghcr.io/huggingface/text-generation-inference:latest 
env:
  - MODEL_ID=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ 
commands:
  - text-generation-launcher
    --port 80
    --trust-remote-code
    --quantize gptq
port: 80

resources:
    gpu: 25GB..50GB 

# (Optional) Enable the OpenAI-compatible endpoint
model:
  type: chat
  name: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
  format: tgi

type: service

python: "3.11"
commands:
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server
    --model mistralai/Mixtral-8X7B-Instruct-v0.1
    --host 0.0.0.0
    --tensor-parallel-size 2 # Should match the number of GPUs
port: 8000

resources:
  gpu: 80GB:2
  disk: 200GB

# (Optional) Enable the OpenAI-compatible endpoint
model:
  type: chat
  name: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
  format: openai

NOTE:

Support for quantized Mixtral in vLLM is not yet stable.

Run the configuration

Prerequisites

Before running a service, make sure to set up a gateway. However, it's not required when using dstack Sky, as it's set up automatically.

$ dstack run . -f llms/mixtral/tgi.dstack.yml

Access the endpoint

Once the service is up, you'll be able to access it at https://<run name>.<gateway domain>.

Authorization

By default, the service endpoint requires the Authorization header with "Bearer <dstack token>".

OpenAI interface

In case the service has the model mapping configured, you will also be able to access the model at https://gateway.<gateway domain> via the OpenAI-compatible interface.

from openai import OpenAI

client = OpenAI(base_url="https://gateway.<gateway domain>", api_key="<dstack token>")

completion = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {
            "role": "user",
            "content": "Compose a poem that explains the concept of recursion in programming.",
        }
    ],
    stream=True,
)

for chunk in completion:
    print(chunk.choices[0].delta.content, end="")
print()
Hugging Face Hub token

To use a model with gated access, ensure configuring the HUGGING_FACE_HUB_TOKEN environment variable (with --env in dstack run or using env in the configuration file).

Source code

The complete, ready-to-run code is available in dstackai/dstack-examples.

What's next?

  1. Check the Text Generation Inference and vLLM examples
  2. Read about services
  3. Browse examples
  4. Join the Discord server