fix: test infrastructure broken + added CI#175
fix: test infrastructure broken + added CI#175suyashkumar102 wants to merge 3 commits intoKathiraveluLab:devfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request establishes a robust testing framework by introducing development dependencies, comprehensive documentation, and shared pytest fixtures. It also refines the keyword clustering logic with improved logging and cluster analysis. Feedback highlights opportunities to optimize test performance by mocking heavy AI models and database connections during setup, improving computational efficiency with NumPy vector operations, and ensuring cross-platform compatibility in the test documentation.
tests/conftest.py
Outdated
| app = create_app(test_config=test_config) | ||
|
|
||
| # Mock MongoDB to avoid requiring a running MongoDB instance | ||
| # Individual tests can override this if they need real DB access | ||
| mock_mongo = MagicMock() | ||
| app.mongo = mock_mongo | ||
|
|
||
| yield app |
There was a problem hiding this comment.
The create_app function instantiates DreamsPipeline, which likely loads heavy AI models. Doing this for every test will significantly slow down the test suite. Additionally, MongoClient should be mocked before create_app is called to avoid any attempts to connect to a real database during initialization. Wrapping the setup in a patch context ensures these heavy dependencies are replaced with mocks during testing.
| app = create_app(test_config=test_config) | |
| # Mock MongoDB to avoid requiring a running MongoDB instance | |
| # Individual tests can override this if they need real DB access | |
| mock_mongo = MagicMock() | |
| app.mongo = mock_mongo | |
| yield app | |
| with patch('dreamsApp.app.DreamsPipeline'), \ | |
| patch('dreamsApp.app.MongoClient'): | |
| app = create_app(test_config=test_config) | |
| # Mock MongoDB to avoid requiring a running MongoDB instance | |
| # Individual tests can override this if they need real DB access | |
| mock_mongo = MagicMock() | |
| app.mongo = mock_mongo | |
| yield app |
dreamsApp/core/extra/clustering.py
Outdated
| # Debug: Log the cluster labels to see how the data is being clustered | ||
| logger.debug(f"Cluster labels: {cluster_labels}") | ||
| unique_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0) | ||
| noise_count = sum(1 for label in cluster_labels if label == -1) |
There was a problem hiding this comment.
dreamsApp/core/extra/clustering.py
Outdated
| logger.debug(f"Cluster labels: {cluster_labels}") | ||
| unique_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0) | ||
| noise_count = sum(1 for label in cluster_labels if label == -1) | ||
| logger.info(f"HDBSCAN produced {unique_clusters} clusters for user {user_id} ({noise_count} noise points)") |
There was a problem hiding this comment.
Logging per-user clustering results at INFO level can lead to extremely verbose logs in production if the number of users is large. It is generally better to use DEBUG for per-item processing details and reserve INFO for high-level process summaries (like the one at line 60).
| logger.info(f"HDBSCAN produced {unique_clusters} clusters for user {user_id} ({noise_count} noise points)") | |
| logger.debug(f"HDBSCAN produced {unique_clusters} clusters for user {user_id} ({noise_count} noise points)") |
tests/conftest.py
Outdated
| import pytest | ||
| import os | ||
| import tempfile | ||
| from unittest.mock import MagicMock |
tests/README.md
Outdated
|
|
||
| # HTML report (opens in browser) | ||
| pytest tests/ -v --cov=dreamsApp --cov-report=html | ||
| open htmlcov/index.html |
There was a problem hiding this comment.
|
Hi @suyashkumar102, |
|
Hi @ayusrjn, Thank you for taking the time to review this, I really appreciate the feedback!
Regarding CI, I saw @pradeeban mentioned the codebase is largely toy implementation right now, which makes sense for a research project. My thinking was that even early-stage projects benefit from automated testing to prevent regressions as more contributions take place. Have retargeted this to the dev branch per your workflow. Also, @pradeeban - would love to hear your thoughts on priorities. Happy to focus wherever it's most valuable. |
|
Hi @suyashkumar102, Your changes are valid for the current state of the repo. We are still in the research stage, and the current architecture is most likely to be changed. Like, we might move away from Flask completely. Introducing the test setup right now for the current architecture might become an additional overhead for the GSoC'26 contributor. I do agree with your CI point. I think this still can be merged as it doesn't introduce any issue and the CI is useful. It might be worth focusing on making the codebase more flexible and modular first. |
|
@ayusrjn |
I cloned the repository to run the test suite and noticed that several test files were failing immediately due to import errors. After investigating, it turned out that some dependencies (e.g., Flask, bson, networkx) were not properly included in the test environment.
To address this, I:
requirements-dev.txtto clearly define development/test dependenciesconftest.pyto centralize shared fixtures used across testsWhile making these changes, I also noticed that
clustering.pywas usingprint()statements instead of structured logging, which isn’t ideal for production code. I’ve replaced those with logging calls.With these updates, tests now run successfully using:
I also added a short README in the
tests/directory to document how to run the test suite, since that was previously missing.Please let me know if this causes any issues on your end or if you’d prefer a different approach.