Merge pull request #9 from neonwatty/index_delta

Index delta
neonwatty · Jul 19, 2024 · e143671 · e143671
2 parents 5d95454 + d1b9133
commit e143671
Show file tree

Hide file tree

Showing 28 changed files with 1,144 additions and 123 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -47,4 +47,6 @@ jobs:
         pip install -r requirements.txt
     - name: Run pytest
       run: |
-        PYTHONPATH=. python3.10 -m pytest tests/test_streamlit.py
+        PYTHONPATH=. python3.10 -m pytest tests/test_app.py &&
+        PYTHONPATH=. python3.10 -m pytest tests/utilities/test_imgs.py &&
+        PYTHONPATH=. python3.10 -m pytest tests/utilities/test_query.py
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,16 @@
+# Change Log
+All notable changes to this project will be documented in this file.
+
+
+## 2024-07-17
+
+### Added
+
+- Core tests added for query, imgs modules, add images re-indexing, remove image re-indexing
+
+- A new "refresh index" button has been introduced to update the index when images are added or removed from the data/input image directory, affecting only the newly added or removed images.
+
+
+<p align="center">
+<img align="center" src="https://github.com/jermwatt/readme_gifs/blob/main/meme_search_refresh_button.gif" height="200">
+</p>
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,61 @@
+# Contributing to Meme Search
+
+Welcome to Meme Search!  We're stoked that you're interested in contributing. 
+
+Before you get started, please take a moment to read through the guidelines below.
+
+
+# How Can I Contribute?
+## Reporting Bugs
+If you encounter a bug or unexpected behavior in Meme Search, please help us by creating an issue in our GitHub repository. Be sure to include as much detail as possible to help us reproduce the issue.
+
+## Suggesting Enhancements
+Have an idea to improve Meme Search?  Bring it on!  You can submit your ideas by creating an issue in our GitHub repository and using the `enhancement` label.
+
+## Contributing Code
+If you're ready to contribute code to Meme Search, follow these steps:
+
+Fork the Repository: Start by forking the repository to your GitHub account.
+
+Clone the Repository: Clone the forked repository to your local machine.
+
+```sh
+git clone https://github.com/neonwatty/meme_search
+```
+
+Create a Branch: Create a new branch for your feature or fix.
+
+```sh
+git checkout -b feature-branch
+```
+
+Make Changes: Make your changes and ensure they follow the coding style of the project.
+
+Test Your Changes: Test your changes to ensure they work as expected.
+
+Commit Your Changes: Commit your changes with a clear and descriptive commit message.
+
+```sh
+git commit -m "Add feature or fix for XYZ"
+```
+
+Push Your Changes: Push your branch to your forked repository.
+
+```sh
+git push origin feature-branch
+```
+
+Create a Pull Request: Create a pull request from your forked repository to the main repository. Be sure to provide a detailed description of your changes.
+
+Review Process: The maintainers will review your pull request and may request changes or provide feedback.
+
+Merge: Once approved, your pull request will be merged into the main repository. Congratulations!
+
+# Code of Conduct
+
+Remember to always be excellent to each other.
+
+# Questions?
+If you have any questions that aren't addressed in this guide, feel free to reach out to us by creating an issue in our GitHub repository.
+
+Thank you for contributing to Meme Search!
diff --git a/README.md b/README.md
@@ -9,13 +9,17 @@ Use Python and AI to index your memes by their content and text, making them eas
 <img align="center" src="https://github.com/jermwatt/readme_gifs/blob/main/meme_search.gif" height="325">
 </p>
 
+
 A table of contents for the remainder of this README:
 
 - [Introduction](#introduction)
 - [Installation instructions](#installation-instructions)
 - [Pipeline overview](#pipeline-overview)
 - [Start the streamlit server](#start-the-streamlit-server)
 - [Index your own memes](#index-your-own-memes)
+- [Changelog](#changelog)
+- [Contributing](#contributing)
+- [Running tests](#running-tests)
 
 ## Introduction
 
@@ -38,7 +42,7 @@ This meme search pipeline is built using the following open source components:
 
 To create a handy tool for your own memes pull the repo and install the requirements file
 
-```python
+```sh
 pip install -r requirements.txt
 ```
 
@@ -55,7 +59,7 @@ docker compose up
 
 After indexing your memes you can then start the streamlit app, allowing you to semantically search for and retrieve your memes
 
-```python
+```sh
 python -m streamlit run meme_search/app.py
 ```
 
@@ -72,13 +76,23 @@ Note: you can drag and drop any recovered meme directly from the streamlit app t
 
 Place any images / memes you would like indexed for the search app in this repo's subdirectory
 
-`data/input/`
+```sh
+data/input/
+```
 
 You can clear out the default test images in this location first, or leave them.
 
-Next - at your terminal - paste the following command
+Next, click the "refresh index" button to update your index when images are added or removed from the image directory, affecting only the newly added or removed images.
+
 
-```python
+<p align="center">
+<img align="center" src="https://github.com/jermwatt/readme_gifs/blob/main/meme_search_refresh_button.gif" height="200">
+</p>
+
+
+Alternatively - at your terminal - paste the following command
+
+```sh
 python meme_search/utilities/create.py
 ```
 
@@ -98,3 +112,26 @@ You will see printouts at the terminal indicating success of the 3 main stages f
 
 3.  **index**: index the embeddings in an open source and local vector base [faiss database](https://github.com/facebookresearch/faiss) and references connecting the embeddings to their images in the greatest little db of all time - [sqlite](https://sqlite.org/)
 
+
+## Changelog
+
+Meme Search is under active development!  See the `CHANGELOG.md` in this repo for a record of the most recent changes.  
+
+## Contributing
+
+Contributions are welcome!  Please see `CONTRIBUTING.md` for basic instructions!
+
+
+## Running tests
+
+Tests can be run by first installing the test requirements as 
+
+```sh
+pip install -r requirements.test
+```
+
+Then the test suite can be run as
+
+```sh
+python -m pytest tests/
+```
diff --git a/data/dbs/memes.db b/data/dbs/memes.db
diff --git a/data/dbs/memes.faiss b/data/dbs/memes.faiss
diff --git a/data/dbs/placeholder b/data/dbs/placeholder
@@ -1 +0,0 @@
-A placeholder file to ensure this directory exists on github

diff --git a/meme_search/app.py b/meme_search/app.py
@@ -1,5 +1,7 @@
-from meme_search import base_dir, sqlite_db_path, vector_db_path
+import time
+from meme_search import base_dir
 from meme_search.utilities.query import complete_query
+from meme_search.utilities.create import process
 import streamlit as st
 
 st.set_page_config(page_title="Meme Search")
@@ -19,16 +21,37 @@ def remote_css(url):
 remote_css("https://fonts.googleapis.com/icon?family=Material+Icons")
 
 # icon("search")
-buff, col, buff2 = st.columns([1, 4, 1])
-
-selected = col.text_input(label="search for meme", placeholder="search for a meme")
-if selected:
-    results = complete_query(selected, vector_db_path, sqlite_db_path)
-    img_paths = [v["img_path"] for v in results]
-    for result in results:
-        with col.container(border=True):
-            st.image(
-                result["img_path"],
-                output_format="auto",
-                caption=f'{result["full_description"]} (query distance = {result["distance"]})',
-            )
+with st.container():
+    with st.container(border=True):
+        input_col, button_col = st.columns([6, 2])
+
+    with button_col:
+        st.empty()
+        refresh_index_button = st.button("refresh index", type="primary")
+        if refresh_index_button:
+            process_start = st.warning("refreshing...")
+            val = process()
+            if val:
+                process_start.empty()
+                success = st.success("index updated!")
+                time.sleep(2)
+                process_start.empty()
+                success.empty()
+            else:
+                process_start.empty()
+                warning = st.warning("no refresh needed!")
+                time.sleep(2)
+                warning.empty()
+
+    selected = input_col.text_input(label="meme search", placeholder="search for your meme", label_visibility="collapsed")
+    if selected:
+        results = complete_query(selected)
+        img_paths = [v["img_path"] for v in results]
+        with st.container(border=True):
+            for result in results:
+                with st.container(border=True):
+                    st.image(
+                        result["img_path"],
+                        output_format="auto",
+                        caption=f'{result["full_description"]} (query distance = {result["distance"]})',
+                    )
diff --git a/meme_search/style.css b/meme_search/style.css
@@ -3,12 +3,12 @@ body {
     background-color: #4F8BF9;
 }
 
-.stButton>button {
+/* .stButton>button {
     color: #4F8BF9;
     border-radius: 50%;
     height: 3em;
     width: 3em;
-}
+} */
 
 .stTextInput>div>div>input {
     color: #4F8BF9;

diff --git a/meme_search/utilities/__init__.py b/meme_search/utilities/__init__.py
@@ -6,5 +6,6 @@
 meme_search_dir = os.path.dirname(utilities_base_dir)
 meme_search_root_dir = os.path.dirname(meme_search_dir)
 
+img_dir = meme_search_root_dir + "/data/input/"
 vector_db_path = meme_search_root_dir + "/data/dbs/memes.faiss"
 sqlite_db_path = meme_search_root_dir + "/data/dbs/memes.db"
diff --git a/meme_search/utilities/add.py b/meme_search/utilities/add.py
@@ -0,0 +1,67 @@
+import os
+import sqlite3
+import faiss
+from meme_search.utilities import model
+from meme_search.utilities.text_extraction import extract_text_from_imgs
+from meme_search.utilities.chunks import create_all_img_chunks
+
+
+def add_to_chunk_db(img_chunks: list, sqlite_db_path: str) -> None:
+    # Create a lookup table for chunks
+    conn = sqlite3.connect(sqlite_db_path)
+    cursor = conn.cursor()
+
+    # Create the table
+    cursor.execute("""
+        CREATE TABLE IF NOT EXISTS chunks_reverse_lookup (
+            img_path TEXT,
+            chunk TEXT
+        );
+    """)
+
+    # Insert data into the table
+    for chunk_index, entry in enumerate(img_chunks):
+        img_path = entry["img_path"]
+        chunk = entry["chunk"]
+        cursor.execute(
+            "INSERT INTO chunks_reverse_lookup (img_path, chunk) VALUES (?, ?)",
+            (img_path, chunk),
+        )
+
+    conn.commit()
+    conn.close()
+
+
+def add_to_vector_db(chunks: list, vector_db_path: str) -> None:
+    # embed inputs
+    embeddings = model.encode(chunks)
+
+    # dump all_embeddings to faiss index
+    if os.path.exists(vector_db_path):
+        index = faiss.read_index(vector_db_path)
+    else:
+        index = faiss.IndexFlatL2(embeddings.shape[1])
+
+    index.add(embeddings)
+    faiss.write_index(index, vector_db_path)
+
+
+def add_to_dbs(img_chunks: list, sqlite_db_path: str, vector_db_path: str) -> None:
+    try:
+        print("STARTING: add_to_dbs")
+
+        # add to db for img_chunks
+        add_to_chunk_db(img_chunks, sqlite_db_path)
+
+        # create vector embedding db for chunks
+        chunks = [v["chunk"] for v in img_chunks]
+        add_to_vector_db(chunks, vector_db_path)
+        print("SUCCESS: add_to_dbs succeeded")
+    except Exception as e:
+        print(f"FAILURE: add_to_dbs failed with exception {e}")
+
+
+def add(new_imgs_to_be_indexed: list, sqlite_db_path: str, vector_db_path: str) -> None:
+    moondream_answers = extract_text_from_imgs(new_imgs_to_be_indexed)
+    img_chunks = create_all_img_chunks(new_imgs_to_be_indexed, moondream_answers)
+    add_to_dbs(img_chunks, sqlite_db_path, vector_db_path)
Original file line number	Diff line number	Diff line change
		@@ -1 +0,0 @@
		A placeholder file to ensure this directory exists on github