Skip to content

Evaluate over specifically provided datasets#5

Open
panalexeu wants to merge 8 commits into
brandonstarxel:devfrom
panalexeu:feature/dataset-evaluation
Open

Evaluate over specifically provided datasets#5
panalexeu wants to merge 8 commits into
brandonstarxel:devfrom
panalexeu:feature/dataset-evaluation

Conversation

@panalexeu
Copy link
Copy Markdown

overview

While evaluating some of my chunking algorithm ideas, I encountered an issue: GeneralEvaluation class, which includes all questions and all corpora datasets, takes too much time to evaluate (in my case, hours).

To address this, I developed the DatasetEvaluation class, which inherits from GeneralEvaluation and accepts a list of dataset names to include in the evaluation. It filters the questions_df and corpus_list properties accordingly.

unit tests

I also included unit tests for the DatasetEvaluation class in tests/test_dataset_evaluation.py. To run them, the pytest package should be installed.

To run the tests, navigate to the root project directory:

cd ./chunking_evaluation 

Then, run:

pytest

usage example

import time

from chunking_evaluation import DatasetEvaluation, GeneralEvaluation, Dataset
from chunking_evaluation.chunking import FixedTokenChunker
from chromadb.utils import embedding_functions
from rich import print

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
chunker = FixedTokenChunker()
eval = DatasetEvaluation(
    datasets=[
        Dataset.PUBMED,
        Dataset.WIKITEXTS,
        Dataset.CHATLOGS,
        Dataset.FINANCE,
        Dataset.STATE_OF_THE_UNION,
    ]
)

if __name__ == '__main__':
    start = time.time()
    results = eval.run(chunker, ef)
    end = time.time()

    print(results)
    print(f'TIME: {end - start}')

@panalexeu panalexeu changed the title Evaluate over specificly provided datasets Evaluate over specifically provided datasets Jan 4, 2025
@brandonstarxel brandonstarxel self-assigned this Jan 4, 2025
@panalexeu
Copy link
Copy Markdown
Author

Hey, I realized I missed something yesterday. In BaseEvaluation, specifically in the run method, there is a loop that iterates over self.questions_df.iter_rows() and retrieves and appends calculated brute_iou_scores, iou_scores, recall_scores, and precision_scores by index. The index values correspond to the indices in questions_df.

However, in _full_precision_score and _scores_from_dataset_and_retrievals methods, these variables were being appended to a list, which caused issues when retrieving them using the questions_df index, resulting in an "Index out of range" error.

To resolve this, I reworked _full_precision_score and _scores_from_dataset_and_retrievals to return dictionaries with the index as the key and the score as the value. This ensures that these variables remain consistent and work correctly with arbitrary datasets number.

Latest commit resolves this issue.

I would like to ask you to carefully review my changes, as I made edits directly to the BaseEvaluation class and may have overlooked some logic.

@panalexeu panalexeu force-pushed the feature/dataset-evaluation branch from 0f86fb4 to 27682fc Compare January 8, 2025 15:31
@panalexeu
Copy link
Copy Markdown
Author

Accidentally committed the test_change, reset the branch HEAD to the commit before the test_change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants