Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slight token number mismatches from hybrid chunker #723

Closed
vagenas opened this issue Jan 10, 2025 · 0 comments · Fixed by DS4SD/docling-core#131
Closed

Slight token number mismatches from hybrid chunker #723

vagenas opened this issue Jan 10, 2025 · 0 comments · Fixed by DS4SD/docling-core#131
Assignees
Labels
bug Something isn't working chunker

Comments

@vagenas
Copy link
Contributor

vagenas commented Jan 10, 2025

Bug

The HybridChunker sometimes produces chunks slightly exceeding the configured token limit.

Steps to reproduce

from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
from transformers import AutoTokenizer

DOC_SOURCE = "https://github.com/DS4SD/docling/raw/main/tests/data/md/wiki.md"
EMBED_MODEL_ID = "ibm-granite/granite-embedding-30m-english"
MAX_TOKENS = 128

doc = DocumentConverter().convert(source=DOC_SOURCE).document
tokenizer = AutoTokenizer.from_pretrained(EMBED_MODEL_ID)

chunker = HybridChunker(
    tokenizer=tokenizer,
    max_tokens=MAX_TOKENS,
    merge_peers=True,
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)

for i, chunk in enumerate(chunks):
    ser_txt = chunker.serialize(chunk=chunk)
    ser_tokens = len(tokenizer.tokenize(ser_txt, max_length=None))
    if ser_tokens > MAX_TOKENS:
        print(f"=== {i} ===")
        print(f"chunker.serialize(chunk) ({ser_tokens} tokens):\n{repr(ser_txt)}")
        print()

Output should be empty, but instead returns these chunks (above MAX_TOKENS):

=== 4 ===
chunker.serialize(chunk) (129 tokens):
"IBM\n1910s–1950s\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the"

=== 6 ===
chunker.serialize(chunk) (130 tokens):
"IBM\n1910s–1950s\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business practices, Watson proceeded to put the stamp"

Docling version

Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.12.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chunker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant