Skip to content

Text Embeddings Inference

This example demonstrates how to use TEI with dstack's services to deploy embeddings.

Define the configuration

To deploy a text embeddings model as a service using TEI, define the following configuration file:

type: service

image: ghcr.io/huggingface/text-embeddings-inference:latest
env:
  - MODEL_ID=thenlper/gte-base
commands: 
  - text-embeddings-router --port 80
port: 80

resources:
  gpu: 16GB

Run the configuration

Gateway

Before running a service, ensure that you have configured a gateway. If you're using dstack Sky, the default gateway is configured automatically for you.

$ dstack run . -f deployment/tae/serve.dstack.yml

Access the endpoint

Once the service is up, you can query it at https://<run name>.<gateway domain> (using the domain set up for the gateway):

Authorization

By default, the service endpoint requires the Authorization header with "Bearer <dstack token>".

$ curl https://yellow-cat-1.example.com \
    -X POST \
    -H 'Content-Type: application/json' \
    -H 'Authorization: "Bearer &lt;dstack token&gt;"' \
    -d '{"inputs":"What is Deep Learning?"}'

[[0.010704354,-0.033910684,0.004793657,-0.0042832214,0.07551489,0.028702762,0.03985837,0.021956133,...]]

Hugging Face Hub token

To use a model with gated access, ensure configuring the HUGGING_FACE_HUB_TOKEN environment variable (with --env in dstack run or using env in the configuration file).

Source code

The complete, ready-to-run code is available in dstackai/dstack-examples.

What's next?

  1. Check the Text Generation Inference and vLLM examples
  2. Read about services
  3. Browse all examples
  4. Join the Discord server