Dataset Search Engine is a multi-source dataset discovery platform built in the Web Search Engine repository. It combines a Python ingestion and indexing pipeline with a Node.js API and a React frontend so users can search, filter, and explore datasets from several public sources through one interface.
This project is designed to:
- aggregate dataset metadata from multiple public providers
- normalize records into one consistent schema
- validate and clean the merged catalog
- build an inverted index for fast local search
- expose ranked search and catalog APIs
- provide a polished browser UI for exploration
- optionally run live "Deep Search" queries against source systems for fresh results
The current implementation includes:
- indexed catalog search over generated local JSON data
- filter support for source, format, and task type
- weighted ranking tuned for dataset discovery intent
- live deep-search requests for broader source coverage
- data quality scripts for validation and link checking
- a legacy Python prototype kept in the repo for reference
- Frontend: React
- API: Node.js + Express
- Data pipeline: Python
- Storage: generated JSON artifacts in
data/
- Python source adapters fetch dataset metadata from Hugging Face, UCI, Kaggle, Data.gov, and GitHub.
- The pipeline validates, deduplicates, and normalizes records into
data/merged_datasets.json. - The index builder creates
data/inverted_index.jsonfor search-time retrieval. - The Node API reads those generated files directly and serves ranked search responses.
- The React frontend calls the API and renders search, filters, and result cards.
- Optional Deep Search bypasses the local index and queries live external sources at request time.
More detail lives in docs/ARCHITECTURE.md.
Web Search Engine/
|-- backend/ Node API and live deep-search integrations
|-- crawler/ Legacy crawling utilities
|-- data/ Seeded and generated catalog/index artifacts
|-- docs/ Repository documentation
|-- frontend/ React application
|-- scripts/ Pipeline orchestration and maintenance scripts
|-- search/ Legacy Python search prototype
|-- search_engine/ Legacy Flask/Node experiment code
|-- sources/ Dataset source adapters and validation helpers
|-- .env.example
|-- CONTRIBUTING.md
`-- README.md
- Hugging Face
- UCI Machine Learning Repository
- Kaggle
- Data.gov
- GitHub repositories with dataset signals
Some adapters work with graceful fallbacks or partial coverage when credentials or network access are limited. Deep Search coverage is intentionally broader but less controlled than the curated local index.
The indexed search flow prioritizes:
- title matches
- tags
- task types
- descriptions
- formats
The API also applies:
- phrase boosts
- source trust weighting
- intent expansions for specific query themes
- penalties for weak or mismatched results
This makes the local search experience stronger than a plain keyword lookup while staying easy to operate from flat files.
The primary API lives in backend/server.js.
GET /api/healthGET /api/search?q=...GET /api/datasets?page=1&limit=12GET /api/sourcesGET /api/deep-search?q=...POST /api/reload
cd frontend
npm installcd backend
npm installpip install nltk huggingface_hub ucimlrepoOptional environment variables:
PORTKAGGLE_USERNAMEKAGGLE_KEYGITHUB_TOKEN
Start from .env.example.
cd backend
npm startcd frontend
npm startThe React app uses a development proxy to http://localhost:3001.
Fetch, clean, and merge data:
python scripts/fetch_all.pyValidate merged datasets:
python scripts/validate_datasets.pyCheck dataset links:
python scripts/check_links.pyBuild the inverted index:
python scripts/build_index.pyRun the full rebuild flow:
python scripts/rebuild_pipeline.pyCommon generated files in data/:
merged_datasets.jsoninverted_index.jsonfetch_report.jsonvalidation_report.jsonlink_check_report.jsonindex_report.json
These artifacts are environment-dependent. Counts may differ between runs depending on network access, third-party API behavior, credentials, and validation outcomes.
The repository contains older prototype code in search/, search_engine/, and parts of crawler/.
For active development, treat these as reference or historical implementation paths unless you are explicitly reviving them. The current product path is:
frontend/backend/scripts/sources/data/
This repository is best presented on GitHub with:
- this README as the entry point
- CONTRIBUTING.md for contributor guidance
- docs/ARCHITECTURE.md for engineering context
- docs/OPERATIONS.md for maintenance and runbook notes
- no formal CI workflow is defined in this repository yet
- backend automated tests are not present
- frontend still includes CRA default test scaffolding
- Kaggle live search is currently query-mapped rather than fully API-driven
- legacy directories should eventually be archived, documented further, or removed
- add a
LICENSEfile once the intended license is confirmed - add issue and pull request templates
- add screenshots or a short demo GIF for the UI
- document deployment expectations if the app will be hosted