Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Embedding Failed #117

Open
jimmyland22 opened this issue Feb 5, 2025 · 0 comments
Open

File Embedding Failed #117

jimmyland22 opened this issue Feb 5, 2025 · 0 comments

Comments

@jimmyland22
Copy link

Trying to get RAG setup with LibreChat. I'm using Docker. Below are the relevant settings in .env.

RAG_API_URL=http://host.docker.internal:8000
RAG_AZURE_OPENAI_API_KEY=xxx
RAG_AZURE_OPENAI_ENDPOINT=https://oai-ocioempsent-dev.openai.azure.com
EMBEDDINGS_PROVIDER=azure
EMBEDDINGS_MODEL=text-embedding-3-small

--
First time uploading a file, I get this error
2025-02-05 15:27:44 rag_api | [nltk_data] Error loading averaged_perceptron_tagger_eng: <urlopen
2025-02-05 15:27:44 rag_api | [nltk_data] error [SSL: CERTIFICATE_VERIFY_FAILED] certificate
2025-02-05 15:27:44 rag_api | [nltk_data] verify failed: unable to get local issuer certificate
2025-02-05 15:27:44 rag_api | [nltk_data] (_ssl.c:1007)>
2025-02-05 15:27:44 rag_api | [nltk_data] Error loading punkt_tab: <urlopen error [SSL:
2025-02-05 15:27:44 rag_api | [nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
2025-02-05 15:27:44 rag_api | [nltk_data] unable to get local issuer certificate (_ssl.c:1007)>
2025-02-05 15:27:44 rag_api | 2025-02-05 23:27:44,556 - root - ERROR - Error during file processing:
2025-02-05 15:27:44 rag_api | **********************************************************************
2025-02-05 15:27:44 rag_api | Resource averaged_perceptron_tagger_eng not found.
2025-02-05 15:27:44 rag_api | Please use the NLTK Downloader to obtain the resource:
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | >>> import nltk
2025-02-05 15:27:44 rag_api | >>> nltk.download('averaged_perceptron_tagger_eng')
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | For more information see: https://www.nltk.org/data.html
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | Attempted to load taggers/averaged_perceptron_tagger_eng/
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | Searched in:
2025-02-05 15:27:44 rag_api | - '/app/nltk_data'
2025-02-05 15:27:44 rag_api | - '/root/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/share/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/lib/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/share/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/share/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/lib/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/lib/nltk_data'
2025-02-05 15:27:44 rag_api | **********************************************************************
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | Traceback: Traceback (most recent call last):
2025-02-05 15:27:44 rag_api | File "/app/main.py", line 476, in embed_file
2025-02-05 15:27:44 rag_api | data = loader.load()
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 31, in load
2025-02-05 15:27:44 rag_api | return list(self.lazy_load())
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 107, in lazy_load
2025-02-05 15:27:44 rag_api | elements = self._get_elements()
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_community/document_loaders/powerpoint.py", line 64, in _get_elements
2025-02-05 15:27:44 rag_api | return partition_pptx(filename=self.file_path, **self.unstructured_kwargs) # type: ignore[arg-type]
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper
2025-02-05 15:27:44 rag_api | elements = func(*args, **kwargs)
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
2025-02-05 15:27:44 rag_api | elements = func(*args, **kwargs)
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/partition/pptx.py", line 126, in partition_pptx
2025-02-05 15:27:44 rag_api | return list(_PptxPartitioner.iter_presentation_elements(opts))
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/partition/pptx.py", line 169, in _iter_presentation_elements
2025-02-05 15:27:44 rag_api | yield from self._iter_shape_elements(shape)
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/partition/pptx.py", line 233, in _iter_shape_elements
2025-02-05 15:27:44 rag_api | elif is_possible_narrative_text(text):
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/partition/text_type.py", line 84, in is_possible_narrative_text
2025-02-05 15:27:44 rag_api | if "eng" in languages and (sentence_count(text, 3) < 2) and (not contains_verb(text)):
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/partition/text_type.py", line 186, in contains_verb
2025-02-05 15:27:44 rag_api | pos_tags = pos_tag(text)
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 78, in pos_tag
2025-02-05 15:27:44 rag_api | parts_of_speech.extend(_pos_tag(tokens))
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/nltk/tag/init.py", line 168, in pos_tag
2025-02-05 15:27:44 rag_api | tagger = _get_tagger(lang)
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/nltk/tag/init.py", line 110, in get_tagger
2025-02-05 15:27:44 rag_api | tagger = PerceptronTagger()
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/nltk/tag/perceptron.py", line 183, in init
2025-02-05 15:27:44 rag_api | self.load_from_json(lang)
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/nltk/tag/perceptron.py", line 273, in load_from_json
2025-02-05 15:27:44 rag_api | loc = find(f"taggers/averaged_perceptron_tagger
{lang}/")
2025-02-05 15:27:44 rag_api | File "/usr/local/lib/python3.10/site-packages/nltk/data.py", line 579, in find
2025-02-05 15:27:44 rag_api | raise LookupError(resource_not_found)
2025-02-05 15:27:44 rag_api | LookupError:
2025-02-05 15:27:44 rag_api | **********************************************************************
2025-02-05 15:27:44 rag_api | Resource averaged_perceptron_tagger_eng not found.
2025-02-05 15:27:44 rag_api | Please use the NLTK Downloader to obtain the resource:
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | >>> import nltk
2025-02-05 15:27:44 rag_api | >>> nltk.download('averaged_perceptron_tagger_eng')
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | For more information see: https://www.nltk.org/data.html
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | Attempted to load taggers/averaged_perceptron_tagger_eng/
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | Searched in:
2025-02-05 15:27:44 rag_api | - '/app/nltk_data'
2025-02-05 15:27:44 rag_api | - '/root/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/share/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/lib/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/share/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/share/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/lib/nltk_data'
2025-02-05 15:27:44 rag_api | - '/usr/local/lib/nltk_data'
2025-02-05 15:27:44 rag_api | **********************************************************************
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api |
2025-02-05 15:27:44 rag_api | 2025-02-05 23:27:44,558 - root - INFO - Request POST http://rag_api:8000/embed - 400
2025-02-05 15:27:44 LibreChat | 2025-02-05 23:27:44 error: Error uploading vectors The request was made and the server responded with a status code that falls out of the range of 2xx: Request failed with status code 400. Error response data:
2025-02-05 15:27:44 LibreChat |
2025-02-05 15:27:44 LibreChat | 2025-02-05 23:27:44 error: [/files] Error processing file: Request failed with status code 400


Subsequent file upload attempts gets this:
2025-02-05 15:37:23 rag_api | 2025-02-05 23:37:23,143 - root - ERROR - Failed to store data in vector DB | File ID: 2f2f487e-2353-4e7e-8a2f-794822351cef | User ID: 67a3a7196f3d4d63c458eb7a | Error: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)'))) | Traceback: Traceback (most recent call last):
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
2025-02-05 15:37:23 rag_api | self._validate_conn(conn)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
2025-02-05 15:37:23 rag_api | conn.connect()
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 730, in connect
2025-02-05 15:37:23 rag_api | sock_and_verified = _ssl_wrap_socket_and_match_hostname(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 909, in ssl_wrap_socket_and_match_hostname
2025-02-05 15:37:23 rag_api | ssl_sock = ssl_wrap_socket(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl
.py", line 469, in ssl_wrap_socket
2025-02-05 15:37:23 rag_api | ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl
.py", line 513, in _ssl_wrap_socket_impl
2025-02-05 15:37:23 rag_api | return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/ssl.py", line 513, in wrap_socket
2025-02-05 15:37:23 rag_api | return self.sslsocket_class._create(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/ssl.py", line 1104, in _create
2025-02-05 15:37:23 rag_api | self.do_handshake()
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/ssl.py", line 1375, in do_handshake
2025-02-05 15:37:23 rag_api | self._sslobj.do_handshake()
2025-02-05 15:37:23 rag_api | ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2025-02-05 15:37:23 rag_api |
2025-02-05 15:37:23 rag_api | During handling of the above exception, another exception occurred:
2025-02-05 15:37:23 rag_api |
2025-02-05 15:37:23 rag_api | Traceback (most recent call last):
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
2025-02-05 15:37:23 rag_api | response = self._make_request(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request
2025-02-05 15:37:23 rag_api | raise new_e
2025-02-05 15:37:23 rag_api | urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2025-02-05 15:37:23 rag_api |
2025-02-05 15:37:23 rag_api | The above exception was the direct cause of the following exception:
2025-02-05 15:37:23 rag_api |
2025-02-05 15:37:23 rag_api | Traceback (most recent call last):
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
2025-02-05 15:37:23 rag_api | resp = conn.urlopen(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
2025-02-05 15:37:23 rag_api | retries = retries.increment(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment
2025-02-05 15:37:23 rag_api | raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
2025-02-05 15:37:23 rag_api | urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))
2025-02-05 15:37:23 rag_api |
2025-02-05 15:37:23 rag_api | During handling of the above exception, another exception occurred:
2025-02-05 15:37:23 rag_api |
2025-02-05 15:37:23 rag_api | Traceback (most recent call last):
2025-02-05 15:37:23 rag_api | File "/app/main.py", line 326, in store_data_in_vector_db
2025-02-05 15:37:23 rag_api | ids = await vector_store.aadd_documents(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_core/vectorstores/base.py", line 324, in aadd_documents
2025-02-05 15:37:23 rag_api | return await run_in_executor(None, self.add_documents, documents, **kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_core/runnables/config.py", line 588, in run_in_executor
2025-02-05 15:37:23 rag_api | return await asyncio.get_running_loop().run_in_executor(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2025-02-05 15:37:23 rag_api | result = self.fn(*self.args, **self.kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_core/runnables/config.py", line 579, in wrapper
2025-02-05 15:37:23 rag_api | return func(*args, **kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_core/vectorstores/base.py", line 287, in add_documents
2025-02-05 15:37:23 rag_api | return self.add_texts(texts, metadatas, **kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_community/vectorstores/pgvector.py", line 561, in add_texts
2025-02-05 15:37:23 rag_api | embeddings = self.embedding_function.embed_documents(list(texts))
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_openai/embeddings/base.py", line 588, in embed_documents
2025-02-05 15:37:23 rag_api | return self._get_len_safe_embeddings(texts, engine=engine)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_openai/embeddings/base.py", line 480, in _get_len_safe_embeddings
2025-02-05 15:37:23 rag_api | _iter, tokens, indices = self._tokenize(texts, _chunk_size)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/langchain_openai/embeddings/base.py", line 420, in _tokenize
2025-02-05 15:37:23 rag_api | encoding = tiktoken.encoding_for_model(model_name)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/tiktoken/model.py", line 105, in encoding_for_model
2025-02-05 15:37:23 rag_api | return get_encoding(encoding_name_for_model(model_name))
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/tiktoken/registry.py", line 86, in get_encoding
2025-02-05 15:37:23 rag_api | enc = Encoding(**constructor())
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/tiktoken_ext/openai_public.py", line 76, in cl100k_base
2025-02-05 15:37:23 rag_api | mergeable_ranks = load_tiktoken_bpe(
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/tiktoken/load.py", line 144, in load_tiktoken_bpe
2025-02-05 15:37:23 rag_api | contents = read_file_cached(tiktoken_bpe_file, expected_hash)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/tiktoken/load.py", line 63, in read_file_cached
2025-02-05 15:37:23 rag_api | contents = read_file(blobpath)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/tiktoken/load.py", line 24, in read_file
2025-02-05 15:37:23 rag_api | resp = requests.get(blobpath)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 73, in get
2025-02-05 15:37:23 rag_api | return request("get", url, params=params, **kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
2025-02-05 15:37:23 rag_api | return session.request(method=method, url=url, **kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
2025-02-05 15:37:23 rag_api | resp = self.send(prep, **send_kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
2025-02-05 15:37:23 rag_api | r = adapter.send(request, **kwargs)
2025-02-05 15:37:23 rag_api | File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 698, in send
2025-02-05 15:37:23 rag_api | raise SSLError(e, request=request)
2025-02-05 15:37:23 LibreChat | 2025-02-05 23:37:23 error: Error uploading vectors
2025-02-05 15:37:23 LibreChat | Something happened in setting up the request. Error message:
2025-02-05 15:37:23 LibreChat | File embedding failed.
2025-02-05 15:37:23 rag_api | requests.exceptions.SSLError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))
2025-02-05 15:37:23 rag_api |
2025-02-05 15:37:23 rag_api | 2025-02-05 23:37:23,143 - root - INFO - Request POST http://rag_api:8000/embed - 200
2025-02-05 15:37:23 LibreChat | 2025-02-05 23:37:23 error: [/files] Error processing file: File embedding failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant