Hi, i am fidding with the ClusterSemanticChunker, thank you for this.
When i ran this, chunks are starting from the separators:
CNN — Tensions between India and Pakistan....
. Tarar’s comments come just one week after m ...
. Kashmir, one of the world’s most dangerous flashpoints, is ....
Code to replicate this:
from chromadb.utils import embedding_functions
from chunking_evaluation.chunking import ClusterSemanticChunker
from chunking_evaluation.utils import openai_token_count
OPEN_API_KEY = ''.......add-your-key.........'
def cluster_semantic_chunks(text):
embedding_function = embedding_functions.OpenAIEmbeddingFunction(
api_key=OPEN_API_KEY,
model_name="text-embedding-3-large"
)
cluster_chunker = ClusterSemanticChunker(
embedding_function=embedding_function,
max_chunk_size=200,
length_function=openai_token_count
)
cluster_chunker_chunks = cluster_chunker.split_text(text)
return cluster_chunker_chunks
if __name__ == '__main__':
text = """CNN — Tensions between India and Pakistan have escalated further after a top Pakistani official claimed early Wednesday to have “credible intelligence” that New Delhi will carry out a military action against Islamabad within the next two days. The claim came as both the United States and China urged restraint. “Pakistan has credible intelligence that India intends carrying out military action against Pakistan in the next 24-36 hours,” Pakistan’s Information Minister Attaullah Tarar said in an unusual middle of the night post on X. He did not elaborate on what evidence Pakistan had used to make the claim. Tarar’s comments come just one week after militants massacred 26 tourists in the mountainous town of Pahalgam in Indian-administered Kashmir, a rampage that has sparked widespread outrage. India has accused Pakistan of being involved in the attack — a claim Islamabad denies. Pakistan has offered a neutral investigation into the incident. CNN has contacted India’s defense ministry for response to Tarar’s claims. The Director Generals of Military Operations of both countries spoke over a hotline on Tuesday, India’s state broadcaster and Pakistan’s military confirmed on Wednesday - the first conversation between the military officials since the Pahalgam attack. Kashmir, one of the world’s most dangerous flashpoints, is controlled in part by India and Pakistan but both countries claim it in its entirety. The two nuclear-armed rivals have fought three wars over the mountainous territory that is now divided by a de-facto border called the Line of Control since their independence from Britain nearly 80 years ago."""
chunks = cluster_semantic_chunks(text)
print('\n\n'.join(chunks))
Fix
#before
splits = [_splits[i] + _splits[i + 1] for i in range(1, len(_splits), 2)]
#after
splits = [_splits[i-1] + _splits[i] for i in range(1, len(_splits), 2)]
Hi, i am fidding with the ClusterSemanticChunker, thank you for this.
When i ran this, chunks are starting from the separators:
Code to replicate this:
Fix