diff --git a/README.md b/README.md index d2249717..d8d1edc6 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,8 @@ Slash Your LLM API Costs by 10x 💰, Boost Speed by 100x ⚡ 📔 This project is undergoing swift development, and as such, the API may be subject to change at any time. For the most up-to-date information, please refer to the latest [documentation]( https://gptcache.readthedocs.io/en/latest/) and [release note](https://github.com/zilliztech/GPTCache/blob/main/docs/release_note.md). +**NOTE:** As the number of large models is growing explosively and their API shape is constantly evolving, we no longer add support for new API or models. We encourage the usage of using the get and set API in gptcache, here is the demo code: https://github.com/zilliztech/GPTCache/blob/main/examples/adapter/api.py + ## Quick Install `pip install gptcache` @@ -279,7 +281,7 @@ GPTCache offers the following primary benefits: - **Decreased expenses**: Most LLM services charge fees based on a combination of number of requests and [token count](https://openai.com/pricing). GPTCache effectively minimizes your expenses by caching query results, which in turn reduces the number of requests and tokens sent to the LLM service. As a result, you can enjoy a more cost-efficient experience when using the service. - **Enhanced performance**: LLMs employ generative AI algorithms to generate responses in real-time, a process that can sometimes be time-consuming. However, when a similar query is cached, the response time significantly improves, as the result is fetched directly from the cache, eliminating the need to interact with the LLM service. In most situations, GPTCache can also provide superior query throughput compared to standard LLM services. - **Adaptable development and testing environment**: As a developer working on LLM applications, you're aware that connecting to LLM APIs is generally necessary, and comprehensive testing of your application is crucial before moving it to a production environment. GPTCache provides an interface that mirrors LLM APIs and accommodates storage of both LLM-generated and mocked data. This feature enables you to effortlessly develop and test your application, eliminating the need to connect to the LLM service. -- **Improved scalability and availability**: LLM services frequently enforce [rate limits](https://platform.openai.com/docs/guides/rate-limits), which are constraints that APIs place on the number of times a user or client can access the server within a given timeframe. Hitting a rate limit means that additional requests will be blocked until a certain period has elapsed, leading to a service outage. With GPTCache, you can easily scale to accommodate an increasing volume of of queries, ensuring consistent performance as your application's user base expands. +- **Improved scalability and availability**: LLM services frequently enforce [rate limits](https://platform.openai.com/docs/guides/rate-limits), which are constraints that APIs place on the number of times a user or client can access the server within a given timeframe. Hitting a rate limit means that additional requests will be blocked until a certain period has elapsed, leading to a service outage. With GPTCache, you can easily scale to accommodate an increasing volume of queries, ensuring consistent performance as your application's user base expands. ## 🤔 How does it work? @@ -348,7 +350,7 @@ This module is created to extract embeddings from requests for similarity search - [ ] Support other storages. - **Vector Store**: The **Vector Store** module helps find the K most similar requests from the input request's extracted embedding. The results can help assess similarity. GPTCache provides a user-friendly interface that supports various vector stores, including Milvus, Zilliz Cloud, and FAISS. More options will be available in the future. - - [x] Support [Milvus](https://milvus.io/), an open-source vector database for production-ready AI/LLM applicaionts. + - [x] Support [Milvus](https://milvus.io/), an open-source vector database for production-ready AI/LLM applications. - [x] Support [Zilliz Cloud](https://cloud.zilliz.com/), a fully-managed cloud vector database based on Milvus. - [x] Support [Milvus Lite](https://github.com/milvus-io/milvus-lite), a lightweight version of Milvus that can be embedded into your Python application. - [x] Support [FAISS](https://faiss.ai/), a library for efficient similarity search and clustering of dense vectors. diff --git a/docs/contributing.md b/docs/contributing.md index d35b09ac..8abe4302 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -102,7 +102,7 @@ refer to the implementation of [milvus](https://github.com/zilliztech/GPTCache/b ## Add a new data manager -refer to the implementation of [MapDataManager, SSDataManager](https://github.com/zilliztech/GPTCache/blob/main/gptcache/cache/data_manager.py). +refer to the implementation of [MapDataManager, SSDataManager](https://github.com/zilliztech/GPTCache/blob/main/gptcache/manager/data_manager.py). 1. Implement the [DataManager](https://github.com/zilliztech/GPTCache/blob/main/gptcache/manager/data_manager.py) interface 2. Add the new store to the [get_data_manager](https://github.com/zilliztech/GPTCache/blob/main/gptcache/manager/data_manager.py) method diff --git a/examples/README.md b/examples/README.md index 5cad5c05..91dc2187 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,13 +1,21 @@ # Example -- [How to run Visual Question Answering with MiniGPT-4](#How-to-run-Visual-Question-Answering-with-MiniGPT-4) -- [How to set the **embedding** function](#How-to-set-the-embedding-function) -- [How to set the **data manager** class](#How-to-set-the-data-manager-class) -- [How to set the **similarity evaluation** interface](#How-to-set-the-similarity-evaluation-interface) -- [Other cache init params](#Other-cache-init-params) -- [How to run with session](#How-to-run-with-session) -- [How to use GPTCache server](#How-to-use-GPTCache-server) -- [Benchmark](#Benchmark) +- [Example](#example) + - [How to run Visual Question Answering with MiniGPT-4](#how-to-run-visual-question-answering-with-minigpt-4) + - [How to set the `embedding` function](#how-to-set-the-embedding-function) + - [Default embedding function](#default-embedding-function) + - [Suitable for embedding methods consisting of a cached storage and vector store](#suitable-for-embedding-methods-consisting-of-a-cached-storage-and-vector-store) + - [Custom embedding](#custom-embedding) + - [How to set the `data manager` class](#how-to-set-the-data-manager-class) + - [How to set the `similarity evaluation` interface](#how-to-set-the-similarity-evaluation-interface) + - [Request cache parameter customization](#request-cache-parameter-customization) + - [How to run with session](#how-to-run-with-session) + - [Run in `with` method](#run-in-with-method) + - [Custom Session](#custom-session) + - [How to use GPTCache server](#how-to-use-gptcache-server) + - [Start server](#start-server) + - [Benchmark](#benchmark) + - [How to use post-process function](#how-to-use-post-process-function) ## How to run Visual Question Answering with MiniGPT-4 @@ -686,3 +694,24 @@ similarity evaluation func: pair_evaluation (search distance) | 0.95 | 0.12s | 425 | 25 | 549 | | 0.9 | 0.23s | 804 | 77 | 118 | | 0.8 | 0.26s | 904 | 92 | 3 | +## How to use post-process function + +You can use the LlmVerifier() function to process the cached answer list after recall. This is similar to `first` or `random_one`, but it will call a LLM to verify whether the recalled question is truly similar to the user's question. You can define your own system prompt to decide under what circumstances the LLM should actively reject. You can also choose a small model to perform the verification step, so only a small additional cost is required. +Example usage: + +```python +from gptcache.processor.post import post + +# ... (init cache, embedding, data_manager, etc.) + +cache.init( + embedding_func=onnx.to_embeddings, + data_manager=data_manager, + similarity_evaluation=SearchDistanceEvaluation(), + post_process_messages_func=LlmVerifier(client=None, + system_prompt=custom_prompt, + model="gpt-3.5-turbo") +) +``` + +See [processor/post_example.py](./processor/post_example.py) for a runnable example. diff --git a/examples/processor/llm_verifier_example.py b/examples/processor/llm_verifier_example.py new file mode 100644 index 00000000..c7f7b118 --- /dev/null +++ b/examples/processor/llm_verifier_example.py @@ -0,0 +1,47 @@ +import time +import os + +from gptcache import cache +from gptcache.adapter import openai +from gptcache.embedding import Onnx +from gptcache.manager import manager_factory +from gptcache.processor.post import LlmVerifier +from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation + +print("This example demonstrates how to use LLM verification with OpenAI's GPT-3.5 Turbo model.") +cache.set_openai_key() + +onnx = Onnx() +data_manager = manager_factory("sqlite,faiss", vector_params={"dimension": onnx.dimension}) + + + + +custom_prompt = """You are a helpful assistant. Your task is to verify whether the answer is semantically consistent with the question. +If the answer is consistent, respond with "yes". If it is not consistent, respond with "no". +You must only respond in "yes" or "no". """ + +verifier = LlmVerifier(client=None, + system_prompt=custom_prompt, + model="gpt-3.5-turbo") + +cache.init( + embedding_func=onnx.to_embeddings, + data_manager=data_manager, + similarity_evaluation=SearchDistanceEvaluation(), + post_process_messages_func=verifier +) + +question = 'what is github' + +for _ in range(3): + start = time.time() + response = openai.ChatCompletion.create( + model='gpt-3.5-turbo', + messages=[{ + 'role': 'user', + 'content': question + }], + ) + print(f"Response: {response['choices'][0]['message']['content']}") + print(f"Time: {round(time.time() - start, 2)}s\n") diff --git a/gptcache/__init__.py b/gptcache/__init__.py index 39df80f7..adbd3527 100644 --- a/gptcache/__init__.py +++ b/gptcache/__init__.py @@ -1,5 +1,5 @@ """gptcache version""" -__version__ = "0.1.43" +__version__ = "0.1.44" from gptcache.config import Config from gptcache.core import Cache diff --git a/gptcache/adapter/adapter.py b/gptcache/adapter/adapter.py index aaa279ea..67cf64cf 100644 --- a/gptcache/adapter/adapter.py +++ b/gptcache/adapter/adapter.py @@ -3,7 +3,7 @@ import numpy as np from gptcache import cache -from gptcache.processor.post import temperature_softmax +from gptcache.processor.post import temperature_softmax, LlmVerifier from gptcache.utils.error import NotInitError from gptcache.utils.log import gptcache_log from gptcache.utils.time import time_cal @@ -189,6 +189,12 @@ def post_process(): scores=[t[0] for t in cache_answers], temperature=temperature, ) + elif chat_cache.post_process_messages_func is LlmVerifier: + return_message = chat_cache.post_process_messages_func( + messages=[t[1] for t in cache_answers], + scores=[t[0] for t in cache_answers], + original_question=pre_embedding_data + ) else: return_message = chat_cache.post_process_messages_func( [t[1] for t in cache_answers] @@ -200,29 +206,30 @@ def post_process(): func_name="post_process", report_func=chat_cache.report.post, )() - chat_cache.report.hint_cache() - cache_whole_data = answers_dict.get(str(return_message)) - if session and cache_whole_data: - chat_cache.data_manager.add_session( - cache_whole_data[2], session.name, pre_embedding_data - ) - if cache_whole_data and not chat_cache.config.disable_report: - # user_question / cache_question / cache_question_id / cache_answer / similarity / consume time/ time - report_cache_data = cache_whole_data[3] - report_search_data = cache_whole_data[2] - chat_cache.data_manager.report_cache( - pre_store_data if isinstance(pre_store_data, str) else "", - report_cache_data.question - if isinstance(report_cache_data.question, str) - else "", - report_search_data[1], - report_cache_data.answers[0].answer - if isinstance(report_cache_data.answers[0].answer, str) - else "", - cache_whole_data[0], - round(time.time() - start_time, 6), - ) - return cache_data_convert(return_message) + if return_message is not None: + chat_cache.report.hint_cache() + cache_whole_data = answers_dict.get(str(return_message)) + if session and cache_whole_data: + chat_cache.data_manager.add_session( + cache_whole_data[2], session.name, pre_embedding_data + ) + if cache_whole_data and not chat_cache.config.disable_report: + # user_question / cache_question / cache_question_id / cache_answer / similarity / consume time/ time + report_cache_data = cache_whole_data[3] + report_search_data = cache_whole_data[2] + chat_cache.data_manager.report_cache( + pre_store_data if isinstance(pre_store_data, str) else "", + report_cache_data.question + if isinstance(report_cache_data.question, str) + else "", + report_search_data[1], + report_cache_data.answers[0].answer + if isinstance(report_cache_data.answers[0].answer, str) + else "", + cache_whole_data[0], + round(time.time() - start_time, 6), + ) + return cache_data_convert(return_message) next_cache = chat_cache.next_cache if next_cache: @@ -444,6 +451,13 @@ def post_process(): scores=[t[0] for t in cache_answers], temperature=temperature, ) + elif chat_cache.post_process_messages_func is LlmVerifier: + return_message = chat_cache.post_process_messages_func( + messages=[t[1] for t in cache_answers], + scores=[t[0] for t in cache_answers], + original_question=pre_embedding_data, + temperature=temperature, + ) else: return_message = chat_cache.post_process_messages_func( [t[1] for t in cache_answers] @@ -455,29 +469,30 @@ def post_process(): func_name="post_process", report_func=chat_cache.report.post, )() - chat_cache.report.hint_cache() - cache_whole_data = answers_dict.get(str(return_message)) - if session and cache_whole_data: - chat_cache.data_manager.add_session( - cache_whole_data[2], session.name, pre_embedding_data - ) - if cache_whole_data: - # user_question / cache_question / cache_question_id / cache_answer / similarity / consume time/ time - report_cache_data = cache_whole_data[3] - report_search_data = cache_whole_data[2] - chat_cache.data_manager.report_cache( - pre_store_data if isinstance(pre_store_data, str) else "", - report_cache_data.question - if isinstance(report_cache_data.question, str) - else "", - report_search_data[1], - report_cache_data.answers[0].answer - if isinstance(report_cache_data.answers[0].answer, str) - else "", - cache_whole_data[0], - round(time.time() - start_time, 6), - ) - return cache_data_convert(return_message) + if return_message is not None: + chat_cache.report.hint_cache() + cache_whole_data = answers_dict.get(str(return_message)) + if session and cache_whole_data: + chat_cache.data_manager.add_session( + cache_whole_data[2], session.name, pre_embedding_data + ) + if cache_whole_data: + # user_question / cache_question / cache_question_id / cache_answer / similarity / consume time/ time + report_cache_data = cache_whole_data[3] + report_search_data = cache_whole_data[2] + chat_cache.data_manager.report_cache( + pre_store_data if isinstance(pre_store_data, str) else "", + report_cache_data.question + if isinstance(report_cache_data.question, str) + else "", + report_search_data[1], + report_cache_data.answers[0].answer + if isinstance(report_cache_data.answers[0].answer, str) + else "", + cache_whole_data[0], + round(time.time() - start_time, 6), + ) + return cache_data_convert(return_message) next_cache = chat_cache.next_cache if next_cache: @@ -485,6 +500,7 @@ def post_process(): kwargs["cache_context"] = context kwargs["cache_skip"] = cache_skip kwargs["cache_factor"] = cache_factor + kwargs["search_only"] = search_only_flag llm_data = adapt( llm_handler, cache_data_convert, update_cache_callback, *args, **kwargs ) diff --git a/gptcache/manager/factory.py b/gptcache/manager/factory.py index 65d4bb2e..12a0468a 100644 --- a/gptcache/manager/factory.py +++ b/gptcache/manager/factory.py @@ -118,6 +118,12 @@ def manager_factory(manager="map", maxmemory_samples=eviction_params.get("maxmemory_samples", scalar_params.get("maxmemory_samples")), ) + if eviction_manager == "memory": + return get_data_manager(s, v, o, None, + eviction_params.get("max_size", 1000), + eviction_params.get("clean_size", None), + eviction_params.get("eviction", "LRU"),) + e = EvictionBase( name=eviction_manager, **eviction_params @@ -194,7 +200,7 @@ def get_data_manager( vector_base = VectorBase(name=vector_base) if isinstance(object_base, str): object_base = ObjectBase(name=object_base) - if isinstance(eviction_base, str): + if isinstance(eviction_base, str) and eviction_base != "memory": eviction_base = EvictionBase(name=eviction_base) assert cache_base and vector_base return SSDataManager(cache_base, vector_base, object_base, eviction_base, max_size, clean_size, eviction) diff --git a/gptcache/processor/post.py b/gptcache/processor/post.py index 9a1c3a6e..66c24b5d 100644 --- a/gptcache/processor/post.py +++ b/gptcache/processor/post.py @@ -87,3 +87,119 @@ def temperature_softmax(messages: List[Any], scores: List[float], temperature: f else: m_s = list(zip(messages, scores)) return sorted(m_s, key=lambda x: x[1], reverse=True)[0][0] + + + +def llm_semantic_verification( + messages: List[Any], + scores: List[float] = None, + original_question: str = None, + *, + client=None, + system_prompt: str = None, + model: str = "gpt-3.5-turbo", + **kwargs +) -> Any: + """ + Use LLM to verify whether the answer is semantically consistent with the question. + If the answer passes verification, return it; otherwise, return None (to trigger a real LLM call). + + :param messages: A list of candidate outputs. + :type messages: List[Any] + :param scores: A list of evaluation scores corresponding to messages. + :type scores: List[float], optional + :param original_question: The original question string. + :type original_question: str, optional + :param client: LLM client object, defaults to None. + :type client: Any, optional + :param system_prompt: System prompt, defaults to None. + :type system_prompt: str, optional + :param model: LLM model name, defaults to "gpt-3.5-turbo". + :type model: str, optional + :param temperature: Sampling temperature, defaults to 0.0. + :type temperature: float, optional + :param kwargs: Other keyword arguments. + :return: The answer if it passes semantic verification, otherwise None. + :rtype: Any + + Example: + .. code-block:: python + + from gptcache.processor.post import llm_semantic_verification + + messages = ["answer1", "answer2"] + scores = [0.9, 0.5] + question = "original question" + answer = llm_semantic_verification(messages, scores, original_question=question) + """ + if not messages or not original_question: + return None + import openai + + # Select the answer with the highest score + best_answer = messages[0] if not scores else messages[scores.index(max(scores))] + if client is None: + client = openai + else: + client = client if hasattr(client, 'completions') else client.chat # Ensure client has the correct method for completions + if system_prompt is None: + system_prompt = ("You are a strict semantic verification assistant. " + "… Only answer 'yes' or 'no'. If unsure, answer 'no'.") + + try: + resp = client.completions.create( + model=model, + messages=[ + {"role": "system", "content": system_prompt}, + {"role": "user", + "content": f"Question: {original_question}\n" + f"Answer: {best_answer}\n" + f"Does this answer fully match the question? yes/no"} + ], + temperature=0, + max_tokens=10 + ) + verdict = resp.choices[0].message.content.strip().lower() + if verdict in {"yes"}: + return best_answer + except Exception as e: + print("LLM verification failed:", e) + + + + return None + + +class LlmVerifier: + """ + LlmVerifier is a callable class that wraps the llm_semantic_verification function. + It stores the LLM client, system prompt, and model name for repeated semantic verification tasks. + + :param client: LLM client object. + :type client: Any + :param system_prompt: System prompt for the LLM. + :type system_prompt: str + :param model: LLM model name, defaults to "gpt-3.5-turbo". + :type model: str, optional + """ + def __init__(self, client=None, system_prompt=None, model="gpt-3.5-turbo"): + self.client = client + self.system_prompt = system_prompt + self.model = model + + def __call__(self, messages, scores=None, original_question=None, **kwargs): + """ + Call the verifier to perform semantic verification using the stored client, prompt, and model. + + :param messages: A list of candidate outputs. + :param scores: A list of evaluation scores corresponding to messages. + :param original_question: The original question string. + :param temperature: Sampling temperature. + :param kwargs: Other keyword arguments. + :return: The answer if it passes semantic verification, otherwise None. + """ + return llm_semantic_verification( + messages, scores=scores, original_question=original_question, + client=self.client, system_prompt=self.system_prompt, + model=self.model, **kwargs + ) diff --git a/gptcache/utils/__init__.py b/gptcache/utils/__init__.py index 093fd354..c0a27a80 100644 --- a/gptcache/utils/__init__.py +++ b/gptcache/utils/__init__.py @@ -105,7 +105,7 @@ def import_huggingface_hub(): def import_onnxruntime(): - _check_library("onnxruntime", package="onnxruntime==1.14.1") + _check_library("onnxruntime", package="onnxruntime==1.21.1") def import_faiss(): diff --git a/tests/unit_tests/processor/test_post.py b/tests/unit_tests/processor/test_post.py index 72bd837d..648dab40 100644 --- a/tests/unit_tests/processor/test_post.py +++ b/tests/unit_tests/processor/test_post.py @@ -1,4 +1,5 @@ from gptcache.processor.post import random_one, first, nop, temperature_softmax +from unittest.mock import Mock def test_random_one(): @@ -28,8 +29,33 @@ def test_temperature_softmax(): assert message == "foo2" +def test_llm_verifier(): + # mock client that always returns 'yes' + mock_client = Mock() + mock_resp = Mock() + mock_choice = Mock() + mock_choice.message.content = 'yes' + mock_resp.choices = [mock_choice] + mock_client.chat.completions.create.return_value = mock_resp + + from gptcache.processor.post import LlmVerifier + verifier = LlmVerifier(client=mock_client, system_prompt="test prompt", model="fake-model") + messages = ["foo", "bar"] + scores = [0.1, 0.9] + result = verifier(messages, scores=scores, original_question="test question") + assert result == "bar" + + # mock client that returns 'no' + mock_choice.message.content = 'no' + result = verifier(messages, scores=scores, original_question="test question") + assert result is None + + + + if __name__ == "__main__": test_first() test_nop() test_random_one() - test_temperature_softmax() \ No newline at end of file + test_temperature_softmax() + test_llm_verifier()