Skip to content

fix: test infrastructure broken + added CI#175

Open
suyashkumar102 wants to merge 3 commits intoKathiraveluLab:devfrom
suyashkumar102:fix/test-infrastructure-and-logging
Open

fix: test infrastructure broken + added CI#175
suyashkumar102 wants to merge 3 commits intoKathiraveluLab:devfrom
suyashkumar102:fix/test-infrastructure-and-logging

Conversation

@suyashkumar102
Copy link
Copy Markdown

I cloned the repository to run the test suite and noticed that several test files were failing immediately due to import errors. After investigating, it turned out that some dependencies (e.g., Flask, bson, networkx) were not properly included in the test environment.

To address this, I:

  • Split out a requirements-dev.txt to clearly define development/test dependencies
  • Added a conftest.py to centralize shared fixtures used across tests
  • Introduced a basic GitHub Actions workflow to run tests in CI

While making these changes, I also noticed that clustering.py was using print() statements instead of structured logging, which isn’t ideal for production code. I’ve replaced those with logging calls.

With these updates, tests now run successfully using:

pip install -r requirements-dev.txt
pytest tests/ -v

I also added a short README in the tests/ directory to document how to run the test suite, since that was previously missing.

Please let me know if this causes any issues on your end or if you’d prefer a different approach.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request establishes a robust testing framework by introducing development dependencies, comprehensive documentation, and shared pytest fixtures. It also refines the keyword clustering logic with improved logging and cluster analysis. Feedback highlights opportunities to optimize test performance by mocking heavy AI models and database connections during setup, improving computational efficiency with NumPy vector operations, and ensuring cross-platform compatibility in the test documentation.

Comment on lines +42 to +49
app = create_app(test_config=test_config)

# Mock MongoDB to avoid requiring a running MongoDB instance
# Individual tests can override this if they need real DB access
mock_mongo = MagicMock()
app.mongo = mock_mongo

yield app
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The create_app function instantiates DreamsPipeline, which likely loads heavy AI models. Doing this for every test will significantly slow down the test suite. Additionally, MongoClient should be mocked before create_app is called to avoid any attempts to connect to a real database during initialization. Wrapping the setup in a patch context ensures these heavy dependencies are replaced with mocks during testing.

Suggested change
app = create_app(test_config=test_config)
# Mock MongoDB to avoid requiring a running MongoDB instance
# Individual tests can override this if they need real DB access
mock_mongo = MagicMock()
app.mongo = mock_mongo
yield app
with patch('dreamsApp.app.DreamsPipeline'), \
patch('dreamsApp.app.MongoClient'):
app = create_app(test_config=test_config)
# Mock MongoDB to avoid requiring a running MongoDB instance
# Individual tests can override this if they need real DB access
mock_mongo = MagicMock()
app.mongo = mock_mongo
yield app

# Debug: Log the cluster labels to see how the data is being clustered
logger.debug(f"Cluster labels: {cluster_labels}")
unique_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
noise_count = sum(1 for label in cluster_labels if label == -1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since cluster_labels is a numpy array (returned by HDBSCAN), using a generator expression with sum() is less efficient than using numpy's built-in vector operations.

Suggested change
noise_count = sum(1 for label in cluster_labels if label == -1)
noise_count = np.count_nonzero(cluster_labels == -1)

logger.debug(f"Cluster labels: {cluster_labels}")
unique_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
noise_count = sum(1 for label in cluster_labels if label == -1)
logger.info(f"HDBSCAN produced {unique_clusters} clusters for user {user_id} ({noise_count} noise points)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Logging per-user clustering results at INFO level can lead to extremely verbose logs in production if the number of users is large. It is generally better to use DEBUG for per-item processing details and reserve INFO for high-level process summaries (like the one at line 60).

Suggested change
logger.info(f"HDBSCAN produced {unique_clusters} clusters for user {user_id} ({noise_count} noise points)")
logger.debug(f"HDBSCAN produced {unique_clusters} clusters for user {user_id} ({noise_count} noise points)")

import pytest
import os
import tempfile
from unittest.mock import MagicMock
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add patch to the imports to support mocking heavy dependencies during application setup in the app fixture.

Suggested change
from unittest.mock import MagicMock
from unittest.mock import MagicMock, patch

tests/README.md Outdated

# HTML report (opens in browser)
pytest tests/ -v --cov=dreamsApp --cov-report=html
open htmlcov/index.html
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The open command is specific to macOS. To make the instructions cross-platform, consider using python -m webbrowser or simply instructing the user to open the file in their browser.

Suggested change
open htmlcov/index.html
python -m webbrowser htmlcov/index.html

@ayusrjn
Copy link
Copy Markdown
Member

ayusrjn commented Apr 7, 2026

Hi @suyashkumar102,
Thank you for your contributions. However, the import error you mentioned has already been fixed by #148. PR #148 fixes all the test issues.
We are still in the early stages of the project. Current tests are just toy implementations. We don't need a CI pipeline just yet.
However, your logger changes are correct.
And one more thing, keep your PR directed to the dev branch.

@suyashkumar102 suyashkumar102 changed the base branch from main to dev April 8, 2026 10:55
@suyashkumar102
Copy link
Copy Markdown
Author

Hi @ayusrjn,

Thank you for taking the time to review this, I really appreciate the feedback!
My PR focuses on a different layer - the test infrastructure itself:

  • requirements-dev.txt: Separating test dependencies from production
  • conftest.py: Shared pytest fixtures for consistent test setup
  • .github/workflows/test.yml: Automated CI to catch issues early
  • tests/README.md: Documentation for running tests
  • clustering.py: Production-grade logging

Regarding CI, I saw @pradeeban mentioned the codebase is largely toy implementation right now, which makes sense for a research project. My thinking was that even early-stage projects benefit from automated testing to prevent regressions as more contributions take place.

Have retargeted this to the dev branch per your workflow.

Also, @pradeeban - would love to hear your thoughts on priorities. Happy to focus wherever it's most valuable.

@ayusrjn
Copy link
Copy Markdown
Member

ayusrjn commented Apr 8, 2026

Hi @suyashkumar102,
The comment was more of a suggestion. The import error fixes I mentioned were based on your PR description, which noted that several tests were failing due to import errors.

Your changes are valid for the current state of the repo. We are still in the research stage, and the current architecture is most likely to be changed. Like, we might move away from Flask completely. Introducing the test setup right now for the current architecture might become an additional overhead for the GSoC'26 contributor.

I do agree with your CI point.

I think this still can be merged as it doesn't introduce any issue and the CI is useful. It might be worth focusing on making the codebase more flexible and modular first.

@suyashkumar102
Copy link
Copy Markdown
Author

@ayusrjn
Thanks so much for the context and support! Really appreciate the insight about focusing on making things more modular first, that's a great point and definitely the right approach for long term flexibility. Will keep that in mind for future contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants