π Thank you for your interest in contributing to the AI Data Collection Toolkit! This document provides guidelines and information for contributors.
- Code of Conduct
- Getting Started
- Development Setup
- How to Contribute
- Development Workflow
- Coding Standards
- Testing Guidelines
- Documentation
- Security
- Performance
- Release Process
This project adheres to a code of conduct that promotes a welcoming and inclusive environment. By participating, you are expected to uphold this code.
We pledge to make participation in our project a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
Positive behaviors include:
- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Focusing on what is best for the community
- Showing empathy towards other community members
Unacceptable behaviors include:
- Harassment, trolling, or discriminatory comments
- Public or private harassment
- Publishing others' private information without permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
- Python 3.8+ (Python 3.11 recommended)
- Git for version control
- Docker (optional, for containerized development)
- Node.js (optional, for documentation tools)
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://github.com/YOUR_USERNAME/Training-Data-Collection.git cd Training-Data-Collection
- Run the setup script:
chmod +x scripts/setup_environment.sh ./scripts/setup_environment.sh
- Activate the virtual environment:
source venv/bin/activate # Linux/macOS # or venv\Scripts\activate # Windows
If you prefer manual setup or the script doesn't work for your system:
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# or venv\Scripts\activate # Windows
# Upgrade pip and install build tools
pip install --upgrade pip setuptools wheel
# Install dependencies
pip install -r requirements-dev.txt
# Install package in editable mode
pip install -e .
# Install pre-commit hooks
pre-commit install
# Verify installation
ai-data-collector --version
pytest --version
# Build development container
docker-compose -f docker/docker-compose.yml build ai-data-collector-dev
# Start development environment
docker-compose -f docker/docker-compose.yml run --rm ai-data-collector-dev bash
# Run tests in container
docker-compose -f docker/docker-compose.yml run --rm ai-data-collector-test
Copy .env.example
to .env
and configure as needed:
cp .env.example .env
# Edit .env with your preferred settings
We welcome various types of contributions:
- π Bug Reports: Help us identify and fix issues
- π Feature Requests: Suggest new functionality
- π‘ Feature Implementation: Build new features
- π Documentation: Improve docs, examples, and guides
- π§ͺ Testing: Add tests, improve coverage
- π§ Infrastructure: CI/CD, Docker, deployment improvements
- π¨ Design: UI/UX improvements for CLI and documentation
- π Security: Security improvements and vulnerability fixes
- Check existing issues to avoid duplicate work
- Create an issue for significant changes to discuss the approach
- Join discussions on relevant issues to coordinate efforts
- Review the roadmap to understand project direction
We use Git Flow with the following branches:
main
: Production-ready codedevelop
: Integration branch for featuresfeature/*
: New featuresbugfix/*
: Bug fixeshotfix/*
: Emergency fixes for productionrelease/*
: Release preparation
-
Create a feature branch:
git checkout develop git pull origin develop git checkout -b feature/your-feature-name
-
Make your changes:
- Write code following our coding standards
- Add tests for new functionality
- Update documentation as needed
- Ensure all tests pass
-
Commit your changes:
# Stage your changes git add . # Commit with descriptive message git commit -m "feat: add support for custom data processors - Implement abstract base class for custom processors - Add processor registration system - Include comprehensive tests and documentation - Fixes #123"
-
Push and create PR:
git push origin feature/your-feature-name # Create pull request on GitHub
We follow Conventional Commits:
<type>[optional scope]: <description>
[optional body]
[optional footer(s)]
Types:
feat
: New featuresfix
: Bug fixesdocs
: Documentation changestest
: Adding or updating testsrefactor
: Code refactoringperf
: Performance improvementsci
: CI/CD changeschore
: Maintenance tasks
Examples:
feat(scrapers): add Playwright scraper engine
fix(processors): handle encoding issues in text cleaner
docs: update configuration guide with new options
test: add integration tests for data export
We follow PEP 8 with these tools:
- Black: Code formatting
- isort: Import sorting
- flake8: Linting
- mypy: Type checking
- pylint: Advanced linting
# Format code
black ai_data_collector/ tests/
isort ai_data_collector/ tests/
# Check linting
flake8 ai_data_collector/ tests/
mypy ai_data_collector/
pylint ai_data_collector/
# Security scan
bandit -r ai_data_collector/
safety check
# Run all checks
./scripts/run_quality_checks.sh
- Clarity over cleverness: Write code that's easy to understand
- Consistency: Follow established patterns in the codebase
- Documentation: Use docstrings and comments effectively
- Error handling: Handle errors gracefully with proper logging
- Type hints: Use type hints for better code documentation
from typing import Dict, List, Optional, Any
from loguru import logger
class DataProcessor:
"""
Process scraped data for AI training.
This class provides methods for cleaning, transforming,
and validating scraped web data.
Args:
config: Processing configuration object
Example:
>>> processor = DataProcessor(config)
>>> cleaned_data = processor.process(raw_data)
"""
def __init__(self, config: ProcessingConfig) -> None:
self.config = config
self._setup_processors()
def process(self, data: Dict[str, Any]) -> Dict[str, Any]:
"""
Process a single data item.
Args:
data: Raw data dictionary to process
Returns:
Processed data dictionary
Raises:
ProcessingError: If processing fails
"""
try:
return self._apply_processing_pipeline(data)
except Exception as e:
logger.error(f"Processing failed: {e}")
raise ProcessingError(f"Failed to process data: {e}") from e
# Good: Specific exception handling with logging
try:
result = risky_operation()
except SpecificError as e:
logger.warning(f"Operation failed: {e}")
return default_value
except Exception as e:
logger.error(f"Unexpected error: {e}")
raise ProcessingError(f"Operation failed: {e}") from e
# Good: Input validation
def process_url(url: str) -> Dict[str, Any]:
if not url or not isinstance(url, str):
raise ValueError("URL must be a non-empty string")
if not validate_url(url):
raise ValueError(f"Invalid URL format: {url}")
# Use structured logging
logger.info("Starting data collection", extra={
"urls_count": len(urls),
"engine": config.engine,
"session_id": session.id
})
# Use configuration objects instead of magic numbers
class ScrapingConfig:
DEFAULT_DELAY = 1.0
MAX_RETRIES = 3
TIMEOUT = 30
tests/
βββ unit/ # Unit tests
β βββ test_scrapers/
β βββ test_processors/
β βββ test_utils/
βββ integration/ # Integration tests
β βββ test_end_to_end/
β βββ test_pipelines/
βββ performance/ # Performance tests
βββ conftest.py # Shared fixtures
import pytest
from unittest.mock import Mock, patch
from ai_data_collector import DataCollector
from ai_data_collector.exceptions import ScrapingError
class TestDataCollector:
"""Test cases for DataCollector class."""
def test_successful_scraping(self, mock_scraping_config, sample_urls):
"""Test successful data collection from URLs."""
collector = DataCollector(mock_scraping_config)
with patch.object(collector.scraper, 'scrape') as mock_scrape:
mock_scrape.return_value = {"url": "https://example.com", "title": "Test"}
results = collector.collect_from_urls(sample_urls)
assert len(results) == len(sample_urls)
assert all("url" in result for result in results)
mock_scrape.assert_called()
def test_scraping_failure_handling(self, mock_scraping_config):
"""Test proper handling of scraping failures."""
collector = DataCollector(mock_scraping_config)
with patch.object(collector.scraper, 'scrape') as mock_scrape:
mock_scrape.side_effect = ScrapingError("Network error")
results = collector.collect_from_urls(["https://example.com"])
assert len(results) == 0
assert len(collector.failed_urls) == 1
@pytest.mark.slow
def test_large_dataset_processing(self, mock_scraping_config):
"""Test processing of large datasets."""
# Performance test for large datasets
pass
- Unit Tests: Test individual functions and classes in isolation
- Integration Tests: Test component interactions
- Performance Tests: Benchmark critical paths
- Security Tests: Test security features and vulnerability handling
- End-to-End Tests: Test complete workflows
# Run all tests
pytest
# Run specific test categories
pytest tests/unit/
pytest tests/integration/
pytest -m "not slow" # Skip slow tests
# Run with coverage
pytest --cov=ai_data_collector --cov-report=html
# Run performance tests
pytest tests/performance/ --benchmark-only
- API Documentation: Docstrings for all public APIs
- User Guides: Step-by-step tutorials
- Examples: Working code examples
- Configuration Guides: Detailed configuration documentation
- Deployment Guides: Production deployment instructions
We use Google-style docstrings:
def scrape_urls(urls: List[str], config: ScrapingConfig) -> List[Dict[str, Any]]:
"""
Scrape data from multiple URLs concurrently.
This function scrapes data from the provided URLs using the specified
configuration. It handles errors gracefully and returns results for
successful scrapes.
Args:
urls: List of URLs to scrape. Must be valid HTTP/HTTPS URLs.
config: Scraping configuration including engine, delays, and limits.
Returns:
List of dictionaries containing scraped data. Each dictionary
includes at minimum 'url' and 'scraped_at' fields.
Raises:
ValueError: If urls list is empty or contains invalid URLs.
ScrapingError: If scraping configuration is invalid.
Example:
>>> config = ScrapingConfig(engine="beautifulsoup")
>>> urls = ["https://example.com", "https://test.com"]
>>> results = scrape_urls(urls, config)
>>> print(f"Scraped {len(results)} pages")
Scraped 2 pages
Note:
This function respects robots.txt files and implements rate limiting
to avoid overwhelming target servers.
"""
When adding new features, update the README with:
- Feature description
- Usage examples
- Configuration options
- Any breaking changes
# Install documentation dependencies
pip install -r requirements-dev.txt
# Build documentation
cd docs/
make html
# Serve documentation locally
python -m http.server 8000 -d _build/html
When contributing, consider these security aspects:
- Input Validation: Validate all external inputs
- SQL Injection: Use parameterized queries
- Path Traversal: Validate file paths
- XSS Prevention: Sanitize any user-controlled output
- Rate Limiting: Implement proper rate limiting
- Authentication: Secure any authentication mechanisms
- Dependency Security: Keep dependencies updated
# Run security scans
bandit -r ai_data_collector/
safety check
semgrep --config=auto ai_data_collector/
# Check for vulnerabilities in dependencies
pip-audit
Do not report security vulnerabilities through public GitHub issues.
Instead, please send an email to [[email protected]] with:
- Description of the vulnerability
- Steps to reproduce
- Potential impact
- Suggested fix (if any)
We will respond within 48 hours and provide a timeline for the fix.
- Profile before optimizing: Use profiling tools to identify bottlenecks
- Memory efficiency: Consider memory usage for large datasets
- Concurrent processing: Use appropriate concurrency levels
- Caching: Implement caching where appropriate
- Database queries: Optimize database interactions
@pytest.mark.benchmark
def test_data_processing_performance(benchmark, sample_data):
"""Benchmark data processing performance."""
processor = DataProcessor(config)
result = benchmark(processor.process_batch, sample_data)
# Performance assertions
assert len(result) == len(sample_data)
# Processing should complete within reasonable time
assert benchmark.stats['mean'] < 0.1 # 100ms per item
We follow Semantic Versioning:
- MAJOR: Breaking changes
- MINOR: New features (backward compatible)
- PATCH: Bug fixes (backward compatible)
- All tests pass
- Documentation updated
- CHANGELOG.md updated
- Version numbers updated
- Security scan completed
- Performance benchmarks stable
- Migration guide created (if needed)
-
Prepare release branch:
git checkout develop git pull origin develop git checkout -b release/v1.1.0
-
Update version numbers:
ai_data_collector/__init__.py
pyproject.toml
docker/Dockerfile
(if applicable)
-
Update CHANGELOG.md:
- Move items from "Unreleased" to new version section
- Add release date
- Ensure all changes are documented
-
Create pull request:
- From release branch to
main
- Include comprehensive testing
- Require maintainer review
- From release branch to
-
Tag and publish:
git tag v1.1.0 git push origin v1.1.0
-
Post-release:
- Merge back to
develop
- Update Docker images
- Announce on relevant channels
- Merge back to
- GitHub Issues: Bug reports and feature requests
- GitHub Discussions: General questions and community discussions
- Email: Direct contact for sensitive issues
- Documentation: Comprehensive guides and API reference
- Security issues: Within 48 hours
- Bug reports: Within 1 week
- Feature requests: Within 2 weeks
- Pull requests: Within 1 week
- Automated checks: All CI checks must pass
- Security review: Automated security scanning
- Code review: At least one maintainer approval required
- Testing: Comprehensive test coverage required
- Documentation: Documentation updates required for new features
Your contributions make this project better for everyone. Whether you're fixing a typo, adding a feature, or improving documentation, every contribution is valuable and appreciated.
Happy coding! π