Skip to content

Conversation

@dlqqq
Copy link
Member

@dlqqq dlqqq commented Oct 16, 2024

As-stated in title. Follow-up to #1024.

@dlqqq dlqqq added the enhancement New feature or request label Oct 16, 2024
Copy link
Collaborator

@srdas srdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive splitter throws the following error:
image
This is presumably because it is a recursive splitter that can parse the JSON file without a chunk size requirement. If so, this would mean chunk overlap is not needed as well.

Same error with different LLMs.

@dlqqq
Copy link
Member Author

dlqqq commented Oct 16, 2024

Even after dropping the arguments, the JSON splitter still raises an exception:

2024-10-16 15:38:09,520 - distributed.worker - ERROR - Compute Failed
Key:       split_document-081f610f-c434-4159-b621-79c44a8909bb
State:     executing
Function:  split_document
args:      (Document(metadata={'path': '/Volumes/workplace/jupyter-ai/package.json', 'sha256': b']\xdb\xa9Y(\x15`\xd5\x89t\xd6\xae"+&\xe1\xfe\xe0\x11\xa3G\x934\n\\y\xc3\x85U\x01\xb65', 'extension': '.json'}, page_content='{\n  "name": "@jupyter-ai/monorepo",\n  "version": "2.25.0",\n  "description": "A generative AI extension for JupyterLab",\n  "private": true,\n  "keywords": [\n    "jupyter",\n    "jupyterlab",\n    "jupyterlab-extension"\n  ],\n  "homepage": "https://github.com/jupyterlab/jupyter-ai",\n  "bugs": {\n    "url": "https://github.com/jupyterlab/jupyter-ai/issues",\n    "email": "[email protected]"\n  },\n  "license": "BSD-3-Clause",\n  "author": {\n    "name": "Project Jupyter",\n    "email": "[email protected]"\n  },\n  "workspaces": [\n    ".",\n    "packages/*"\n  ],\n  "scripts": {\n    "build": "lerna run build --stream",\n    "build:core": "lerna run build --stream --scope \\"@jupyter-ai/core\\"",\n    "build:prod": "lerna run build:prod --stream",\n    "clean":
kwargs:    {}
Exception: "IndexError('list index out of range')"
Traceback: '  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/directory.py", line 107, in split_document\n    return splitter.split_documents([document])\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/base.py", line 96, in split_documents\n    return self.create_documents(texts, metadatas=metadatas)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 31, in create_documents\n    for chunk in self.split_text(text, metadata):\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 22, in split_text\n    return splitter.split_text(text)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 106, in split_text\n    chunks = self.split_json(json_data=json_data, convert_lists=convert_lists)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 91, in split_json\n    chunks = self._json_split(json_data)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 78, in _json_split\n    self._set_nested_dict(chunks[-1], current_path, data)\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 32, in _set_nested_dict\n    d[path[-1]] = value\n      ~~~~^^^^\n'

It doesn't seem like RecursiveJsonSplitter is well-supported, since it seems to have a different interface than all the other splitters we use from LangChain. I'm putting this in draft status as there doesn't seem to be a clear path forward; may close this next week, or mark it as ready if I figure something out.

@dlqqq dlqqq marked this pull request as draft October 16, 2024 22:49
@ellisonbg
Copy link
Collaborator

In Jupyter AI v3, we are moving away from manual indexing, splitting and embedding. Closing for now.

@ellisonbg ellisonbg closed this Jul 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants