Skip to content


This example demonstrates how to use Infinity with dstack' s services to deploy any SentenceTransformers based embedding models.

Define the configuration

To deploy a SentenceTransformers based embedding models using Infinity, you need to define the following configuration file at minimum:

type: service

image: michaelf34/infinity:latest
  - MODEL_ID=BAAI/bge-small-en-v1.5
  - infinity_emb --model-name-or-path $MODEL_ID --port 80
port: 80

  gpu: 16GB

Run the configuration


Before running a service, ensure that you have configured a gateway. If you're using dstack Sky, the default gateway is configured automatically for you.

$ dstack run . -f infinity/serve.dstack.yml

Access the endpoint

Once the service is up, you can query it at https://<run name>.<gateway domain> (using the domain set up for the gateway):


By default, the service endpoint requires the Authorization header with "Bearer <dstack token>".

OpenAI interface

Any embedding models served by Infinity automatically comes with OpenAI's Embeddings APIs compatible APIs, so we can directly use openai package to interact with the deployed Infinity.

from openai import OpenAI
from functools import partial

client = OpenAI(base_url="https://<run name>.<gateway domain>", api_key="<dstack token>")

client.embeddings.create = partial(
  client.embeddings.create, model="bge-small-en-v1.5"

print(client.embeddings.create(input=["A sentence to encode."]))

Source code

The complete, ready-to-run code is available in dstackai/dstack-examples.

What's next?

  1. Check the Text Embeddings Inference, TGI, and vLLM examples
  2. Read about services
  3. Browse all examples
  4. Join the Discord server