You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While evaluating some of my chunking algorithm ideas, I encountered an issue: GeneralEvaluation class, which includes all questions and all corpora datasets, takes too much time to evaluate (in my case, hours).
To address this, I developed the DatasetEvaluation class, which inherits from GeneralEvaluation and accepts a list of dataset names to include in the evaluation. It filters the questions_df and corpus_list properties accordingly.
unit tests
I also included unit tests for the DatasetEvaluation class in tests/test_dataset_evaluation.py. To run them, the pytest package should be installed.
To run the tests, navigate to the root project directory:
Hey, I realized I missed something yesterday. In BaseEvaluation, specifically in the run method, there is a loop that iterates over self.questions_df.iter_rows() and retrieves and appends calculated brute_iou_scores, iou_scores, recall_scores, and precision_scores by index. The index values correspond to the indices in questions_df.
However, in _full_precision_score and _scores_from_dataset_and_retrievals methods, these variables were being appended to a list, which caused issues when retrieving them using the questions_df index, resulting in an "Index out of range" error.
To resolve this, I reworked _full_precision_score and _scores_from_dataset_and_retrievals to return dictionaries with the index as the key and the score as the value. This ensures that these variables remain consistent and work correctly with arbitrary datasets number.
Latest commit resolves this issue.
I would like to ask you to carefully review my changes, as I made edits directly to the BaseEvaluation class and may have overlooked some logic.
Accidentally committed the test_change, reset the branch HEAD to the commit before the test_change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
overview
While evaluating some of my chunking algorithm ideas, I encountered an issue:
GeneralEvaluationclass, which includes all questions and all corpora datasets, takes too much time to evaluate (in my case, hours).To address this, I developed the
DatasetEvaluationclass, which inherits fromGeneralEvaluationand accepts a list of dataset names to include in the evaluation. It filters thequestions_dfandcorpus_listproperties accordingly.unit tests
I also included unit tests for the
DatasetEvaluationclass intests/test_dataset_evaluation.py. To run them, the pytest package should be installed.To run the tests, navigate to the root project directory:
cd ./chunking_evaluationThen, run:
usage example