Fix chunk_size to be 1000 tokens, not characters #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello, my friend. Thank you for the video and for repository.
When we use just
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100), then chunks contain up to 1000 characters, not tokens. You told in video and there is written in the code, that one chunk has 1000 tokens. To make so, we can useRecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=100)with installed tiktoken library. Then we will have ~1000 tokens in one chunk.