-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support for different vector stores (#56)
* Outline for a minimal FastAPI interface to range of models * start sketching out API based on existing code / model * fastapi[standard] in env, dump of catalog data locally * kludge interface, full of TODOs to the new 3 class model * Return a class label from the Resnet18 model * Return both classification and embeddings from Resnet18 model, roughly * load models as globals when worker starts, see comments * add a basic API test and a wee bit of error handling * pyproject dependencies needed for the pipeline * limit CI tests to run only for code changes * should have set this a lot sooner! * Add an abstract interface to vectorstores, breaking the tests * fix up the existing chromadb tests * clean up some points of vector store reuse * extend the test coverage, have the app use new interface * remember to commit the new config.py for the app * fix interface in the scripts (we're not using much) * remove workflow_call from test CI config * slowly flesh out the sqlite-vec storage option * fill out the sqlite implementation * Revert "remove workflow_call from test CI config" This reverts commit d330422. * Reapply "remove workflow_call from test CI config" This reverts commit cfe1c25. * put `workflow_call` back and try to limit caller's paths * YAML whitespace glitch? * test queries in the chroma wrapper, tweak output * deserialise vectors packed as bytes back to floats, test * remove the caller pipeline, just complication * expand the base class, stub interfaces for different stores * test shared behaviour of different backends, prune print statements * give the workflow a nudge without paths, regret the change now * paths need glob expansion, "naturally" * sqlite3 is in python core! * whitespace change, nudge the workflow * limit N syntax is sqlite3 version specific... * .[lint] still installs everything else - direct from pypi instead * (brittle) test check whether we have weights downloaded * optional embeddings length on init database; explicit commit() * generalised config options for different backends
- Loading branch information
Showing
19 changed files
with
459 additions
and
125 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Vector stores | ||
|
||
Investigation of alternative vector stores for image model embeddings. | ||
|
||
## ChromaDB | ||
|
||
* "Simplest useful thing", default in the LangChain examples for LLM rapid prototyping | ||
* Idiosyncratic, not standards-oriented | ||
* Evolving quickly (a couple of back-incompatible API changes since starting with it) | ||
|
||
## SQLite-vec | ||
|
||
* Lightweight and helpful examples, quick to start with? | ||
* Single process | ||
* "_expect breaking changes!_" | ||
|
||
https://til.simonwillison.net/sqlite/sqlite-vec | ||
|
||
https://github.com/asg017/sqlite-vec | ||
|
||
https://github.com/asg017/sqlite-vec/releases | ||
|
||
``` | ||
pip install sqlite-utils | ||
sqlite-utils install sqlite-utils-sqlite-vec | ||
``` | ||
|
||
Main use is in the `streamlit` app which is _really_ tied to the internal logic of `chromadb` :/ | ||
|
||
Queries are | ||
|
||
* get all identifiers (need `LIMIT` for large collection) - URLs were used directly as IDs | ||
* get embeddings vector for one ID | ||
* get N closest results to one set of embeddings by cosine similarity | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
"""Extract and store image embeddings from a collection in s3, | ||
using an API that calls one or more off-the-shelf pre-trained models""" | ||
|
||
import os | ||
import logging | ||
import yaml | ||
from dotenv import load_dotenv | ||
from cyto_ml.data.vectorstore import vector_store | ||
import pandas as pd | ||
import requests | ||
|
||
logging.basicConfig(level=logging.info) | ||
load_dotenv() | ||
|
||
ENDPOINT = "http://localhost:8000/resnet18/" | ||
PARAMS = os.path.join(os.path.abspath(os.path.dirname(__file__)), "params.yaml") | ||
|
||
if __name__ == "__main__": | ||
|
||
# Limited to the Lancaster FlowCam dataset for now: | ||
image_bucket = yaml.safe_load(open(PARAMS))["collection"] | ||
catalog = f"{image_bucket}/catalog.csv" | ||
|
||
file_index = f"{os.environ.get('AWS_URL_ENDPOINT')}/{catalog}" | ||
df = pd.read_csv(file_index) | ||
|
||
# TODO - optional embedding length param at this point, it's not ideal | ||
collection = vector_store("sqlite", image_bucket, embedding_len=512) | ||
|
||
def store_embeddings(url): | ||
response = requests.post(ENDPOINT, data={"url": url}).json() | ||
if not "embeddings" in response: | ||
logging.error(response) | ||
raise | ||
|
||
response["url"] = url | ||
collection.add(**response) | ||
|
||
for _, row in df.iterrows(): | ||
store_embeddings(row.item()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# TODO manage this better elsewhere, once we settle on a storage option | ||
SQLITE_SCHEMA = """ | ||
create virtual table embeddings using vec0( | ||
id integer primary key, | ||
url text not null, | ||
classification text not null, | ||
embedding float[{}]); | ||
""" | ||
|
||
# Options passed as keyword arguments when setting a db connection | ||
OPTIONS = {"sqlite": {"embedding_len": 512, "check_same_thread": False}, "chromadb": {}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.