Skip to content

MITLibraries/timdex-embeddings

Repository files navigation

timdex-embeddings

A CLI application for creating embeddings for TIMDEX.

Development

  • To preview a list of available Makefile commands: make help
  • To install with dev dependencies: make install
  • To update dependencies: make update
  • To run unit tests: make test
  • To lint the repo: make lint
  • To run the app: my-app --help (Note the hyphen - vs underscore _ that matches the project.scripts in pyproject.toml)

Environment Variables

Required

SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
WORKSPACE=### Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.

Optional

TE_MODEL_URI=# HuggingFace model URI
TE_MODEL_PATH=# Path where the model will be downloaded to and loaded from
HF_HUB_DISABLE_PROGRESS_BARS=#boolean to use progress bars for HuggingFace model downloads; defaults to 'true' in deployed contexts
# inference performance tuning
TE_TORCH_DEVICE=# defaults to 'cpu', but can be set to 'mps' for Apple Silicon, or theoretically 'cuda' for GPUs
TE_BATCH_SIZE=# batch size for each inference worker, defaults to 32
TE_NUM_WORKERS=# number of parallel model inference workers, defaults to 1
TE_CHUNK_SIZE=# number of batches each parallel worker grabs; no effect if TE_NUM_WORKERS=1
OMP_NUM_THREADS=# torch env var that sets thread usage during inference, default is not setting and using torch defaults
MKL_NUM_THREADS=# torch env var that sets thread usage during inference, default is not setting and using torch defaults

Configuring an Embedding Model

This CLI application is designed to create embeddings for input texts. To do this, a pre-trained model must be identified and configured for use.

To this end, there is a base embedding class BaseEmbeddingModel that is designed to be extended and customized for a particular embedding model.

Once an embedding class has been created, the preferred approach is to set env vars TE_MODEL_URI and TE_MODEL_PATH directly in the Dockerfile to a) download a local snapshot of the model during image build, and b) set this model as the default for the CLI.

This allows invoking the CLI without specifying a model URI or local location, allowing this model to serve as the default, e.g.:

uv run --env-file .env embeddings test-model-load

CLI Commands

For local development, all CLI commands should be invoked with the following format to pickup environment variables from .env:

uv run --env-file .env embeddings <COMMAND> <ARGS>

ping

Usage: embeddings ping [OPTIONS]

  Emit 'pong' to debug logs and stdout.

download-model

Usage: embeddings download-model [OPTIONS]

  Download a model from HuggingFace and save locally.

Options:
  --model-uri TEXT   HuggingFace model URI (e.g., 'org/model-name')
                     [required]
  --model-path PATH  Path where the model will be downloaded to and loaded
                     from, e.g. '/path/to/model'.  [required]
  --help             Show this message and exit.

test-model-load

Usage: embeddings test-model-load [OPTIONS]

  Test loading of embedding class and local model based on env vars.

  In a deployed context, the following env vars are expected:     -
  TE_MODEL_URI     - TE_MODEL_PATH

  With these set, the embedding class should be registered successfully and
  initialized, and the model loaded from a local copy.

  This CLI command is NOT used during normal workflows.  This is used primary
  during development and after model downloading/loading changes to ensure the
  model loads correctly.

Options:
  --model-uri TEXT   HuggingFace model URI (e.g., 'org/model-name')
                     [required]
  --model-path PATH  Path where the model will be downloaded to and loaded
                     from, e.g. '/path/to/model'.  [required]
  --help             Show this message and exit.

create-embeddings

Usage: embeddings create-embeddings [OPTIONS]

  Create embeddings for TIMDEX records.

Options:
  --model-uri TEXT             HuggingFace model URI (e.g., 'org/model-name')
                               [required]
  --model-path PATH            Path where the model will be downloaded to and
                               loaded from, e.g. '/path/to/model'.  [required]
  --dataset-location PATH      TIMDEX dataset location, e.g.
                               's3://timdex/dataset', to read records from.
  --run-id TEXT                TIMDEX ETL run id.
  --run-record-offset INTEGER  TIMDEX ETL run record offset to start from,
                               default = 0.
  --record-limit INTEGER       Limit number of records after --run-record-
                               offset, default = None (unlimited).
  --input-jsonl TEXT           Optional filepath to JSONLines file containing
                               TIMDEX records to create embeddings from.
  --strategy [full_record]     Pre-embedding record transformation strategy.
                               Repeatable to apply multiple strategies.
                               [required]
  --output-jsonl TEXT          Optionally write embeddings to local JSONLines
                               file (primarily for testing).
  --help                       Show this message and exit.

About

TIMDEX Embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •