Skip to content

Commit

Permalink
Merge pull request #9 from neonwatty/index_delta
Browse files Browse the repository at this point in the history
Index delta
  • Loading branch information
neonwatty authored Jul 19, 2024
2 parents 5d95454 + d1b9133 commit e143671
Show file tree
Hide file tree
Showing 28 changed files with 1,144 additions and 123 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,6 @@ jobs:
pip install -r requirements.txt
- name: Run pytest
run: |
PYTHONPATH=. python3.10 -m pytest tests/test_streamlit.py
PYTHONPATH=. python3.10 -m pytest tests/test_app.py &&
PYTHONPATH=. python3.10 -m pytest tests/utilities/test_imgs.py &&
PYTHONPATH=. python3.10 -m pytest tests/utilities/test_query.py
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Change Log
All notable changes to this project will be documented in this file.


## 2024-07-17

### Added

- Core tests added for query, imgs modules, add images re-indexing, remove image re-indexing

- A new "refresh index" button has been introduced to update the index when images are added or removed from the data/input image directory, affecting only the newly added or removed images.


<p align="center">
<img align="center" src="https://github.com/jermwatt/readme_gifs/blob/main/meme_search_refresh_button.gif" height="200">
</p>
61 changes: 61 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Contributing to Meme Search

Welcome to Meme Search! We're stoked that you're interested in contributing.

Before you get started, please take a moment to read through the guidelines below.


# How Can I Contribute?
## Reporting Bugs
If you encounter a bug or unexpected behavior in Meme Search, please help us by creating an issue in our GitHub repository. Be sure to include as much detail as possible to help us reproduce the issue.

## Suggesting Enhancements
Have an idea to improve Meme Search? Bring it on! You can submit your ideas by creating an issue in our GitHub repository and using the `enhancement` label.

## Contributing Code
If you're ready to contribute code to Meme Search, follow these steps:

Fork the Repository: Start by forking the repository to your GitHub account.

Clone the Repository: Clone the forked repository to your local machine.

```sh
git clone https://github.com/neonwatty/meme_search
```

Create a Branch: Create a new branch for your feature or fix.

```sh
git checkout -b feature-branch
```

Make Changes: Make your changes and ensure they follow the coding style of the project.

Test Your Changes: Test your changes to ensure they work as expected.

Commit Your Changes: Commit your changes with a clear and descriptive commit message.

```sh
git commit -m "Add feature or fix for XYZ"
```

Push Your Changes: Push your branch to your forked repository.

```sh
git push origin feature-branch
```

Create a Pull Request: Create a pull request from your forked repository to the main repository. Be sure to provide a detailed description of your changes.

Review Process: The maintainers will review your pull request and may request changes or provide feedback.

Merge: Once approved, your pull request will be merged into the main repository. Congratulations!

# Code of Conduct

Remember to always be excellent to each other.

# Questions?
If you have any questions that aren't addressed in this guide, feel free to reach out to us by creating an issue in our GitHub repository.

Thank you for contributing to Meme Search!
47 changes: 42 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,17 @@ Use Python and AI to index your memes by their content and text, making them eas
<img align="center" src="https://github.com/jermwatt/readme_gifs/blob/main/meme_search.gif" height="325">
</p>


A table of contents for the remainder of this README:

- [Introduction](#introduction)
- [Installation instructions](#installation-instructions)
- [Pipeline overview](#pipeline-overview)
- [Start the streamlit server](#start-the-streamlit-server)
- [Index your own memes](#index-your-own-memes)
- [Changelog](#changelog)
- [Contributing](#contributing)
- [Running tests](#running-tests)

## Introduction

Expand All @@ -38,7 +42,7 @@ This meme search pipeline is built using the following open source components:

To create a handy tool for your own memes pull the repo and install the requirements file

```python
```sh
pip install -r requirements.txt
```

Expand All @@ -55,7 +59,7 @@ docker compose up

After indexing your memes you can then start the streamlit app, allowing you to semantically search for and retrieve your memes

```python
```sh
python -m streamlit run meme_search/app.py
```

Expand All @@ -72,13 +76,23 @@ Note: you can drag and drop any recovered meme directly from the streamlit app t

Place any images / memes you would like indexed for the search app in this repo's subdirectory

`data/input/`
```sh
data/input/
```

You can clear out the default test images in this location first, or leave them.

Next - at your terminal - paste the following command
Next, click the "refresh index" button to update your index when images are added or removed from the image directory, affecting only the newly added or removed images.


```python
<p align="center">
<img align="center" src="https://github.com/jermwatt/readme_gifs/blob/main/meme_search_refresh_button.gif" height="200">
</p>


Alternatively - at your terminal - paste the following command

```sh
python meme_search/utilities/create.py
```

Expand All @@ -98,3 +112,26 @@ You will see printouts at the terminal indicating success of the 3 main stages f

3. **index**: index the embeddings in an open source and local vector base [faiss database](https://github.com/facebookresearch/faiss) and references connecting the embeddings to their images in the greatest little db of all time - [sqlite](https://sqlite.org/)


## Changelog

Meme Search is under active development! See the `CHANGELOG.md` in this repo for a record of the most recent changes.

## Contributing

Contributions are welcome! Please see `CONTRIBUTING.md` for basic instructions!


## Running tests

Tests can be run by first installing the test requirements as

```sh
pip install -r requirements.test
```

Then the test suite can be run as

```sh
python -m pytest tests/
```
Binary file modified data/dbs/memes.db
Binary file not shown.
Binary file modified data/dbs/memes.faiss
Binary file not shown.
1 change: 0 additions & 1 deletion data/dbs/placeholder
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
A placeholder file to ensure this directory exists on github
51 changes: 37 additions & 14 deletions meme_search/app.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from meme_search import base_dir, sqlite_db_path, vector_db_path
import time
from meme_search import base_dir
from meme_search.utilities.query import complete_query
from meme_search.utilities.create import process
import streamlit as st

st.set_page_config(page_title="Meme Search")
Expand All @@ -19,16 +21,37 @@ def remote_css(url):
remote_css("https://fonts.googleapis.com/icon?family=Material+Icons")

# icon("search")
buff, col, buff2 = st.columns([1, 4, 1])

selected = col.text_input(label="search for meme", placeholder="search for a meme")
if selected:
results = complete_query(selected, vector_db_path, sqlite_db_path)
img_paths = [v["img_path"] for v in results]
for result in results:
with col.container(border=True):
st.image(
result["img_path"],
output_format="auto",
caption=f'{result["full_description"]} (query distance = {result["distance"]})',
)
with st.container():
with st.container(border=True):
input_col, button_col = st.columns([6, 2])

with button_col:
st.empty()
refresh_index_button = st.button("refresh index", type="primary")
if refresh_index_button:
process_start = st.warning("refreshing...")
val = process()
if val:
process_start.empty()
success = st.success("index updated!")
time.sleep(2)
process_start.empty()
success.empty()
else:
process_start.empty()
warning = st.warning("no refresh needed!")
time.sleep(2)
warning.empty()

selected = input_col.text_input(label="meme search", placeholder="search for your meme", label_visibility="collapsed")
if selected:
results = complete_query(selected)
img_paths = [v["img_path"] for v in results]
with st.container(border=True):
for result in results:
with st.container(border=True):
st.image(
result["img_path"],
output_format="auto",
caption=f'{result["full_description"]} (query distance = {result["distance"]})',
)
4 changes: 2 additions & 2 deletions meme_search/style.css
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ body {
background-color: #4F8BF9;
}

.stButton>button {
/* .stButton>button {
color: #4F8BF9;
border-radius: 50%;
height: 3em;
width: 3em;
}
} */

.stTextInput>div>div>input {
color: #4F8BF9;
Expand Down
1 change: 1 addition & 0 deletions meme_search/utilities/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@
meme_search_dir = os.path.dirname(utilities_base_dir)
meme_search_root_dir = os.path.dirname(meme_search_dir)

img_dir = meme_search_root_dir + "/data/input/"
vector_db_path = meme_search_root_dir + "/data/dbs/memes.faiss"
sqlite_db_path = meme_search_root_dir + "/data/dbs/memes.db"
67 changes: 67 additions & 0 deletions meme_search/utilities/add.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import os
import sqlite3
import faiss
from meme_search.utilities import model
from meme_search.utilities.text_extraction import extract_text_from_imgs
from meme_search.utilities.chunks import create_all_img_chunks


def add_to_chunk_db(img_chunks: list, sqlite_db_path: str) -> None:
# Create a lookup table for chunks
conn = sqlite3.connect(sqlite_db_path)
cursor = conn.cursor()

# Create the table
cursor.execute("""
CREATE TABLE IF NOT EXISTS chunks_reverse_lookup (
img_path TEXT,
chunk TEXT
);
""")

# Insert data into the table
for chunk_index, entry in enumerate(img_chunks):
img_path = entry["img_path"]
chunk = entry["chunk"]
cursor.execute(
"INSERT INTO chunks_reverse_lookup (img_path, chunk) VALUES (?, ?)",
(img_path, chunk),
)

conn.commit()
conn.close()


def add_to_vector_db(chunks: list, vector_db_path: str) -> None:
# embed inputs
embeddings = model.encode(chunks)

# dump all_embeddings to faiss index
if os.path.exists(vector_db_path):
index = faiss.read_index(vector_db_path)
else:
index = faiss.IndexFlatL2(embeddings.shape[1])

index.add(embeddings)
faiss.write_index(index, vector_db_path)


def add_to_dbs(img_chunks: list, sqlite_db_path: str, vector_db_path: str) -> None:
try:
print("STARTING: add_to_dbs")

# add to db for img_chunks
add_to_chunk_db(img_chunks, sqlite_db_path)

# create vector embedding db for chunks
chunks = [v["chunk"] for v in img_chunks]
add_to_vector_db(chunks, vector_db_path)
print("SUCCESS: add_to_dbs succeeded")
except Exception as e:
print(f"FAILURE: add_to_dbs failed with exception {e}")


def add(new_imgs_to_be_indexed: list, sqlite_db_path: str, vector_db_path: str) -> None:
moondream_answers = extract_text_from_imgs(new_imgs_to_be_indexed)
img_chunks = create_all_img_chunks(new_imgs_to_be_indexed, moondream_answers)
add_to_dbs(img_chunks, sqlite_db_path, vector_db_path)
Loading

0 comments on commit e143671

Please sign in to comment.