Dataset Search Engine

Dataset Search Engine is a multi-source dataset discovery platform built in the Web Search Engine repository. It combines a Python ingestion and indexing pipeline with a Node.js API and a React frontend so users can search, filter, and explore datasets from several public sources through one interface.

Repository Purpose

This project is designed to:

aggregate dataset metadata from multiple public providers
normalize records into one consistent schema
validate and clean the merged catalog
build an inverted index for fast local search
expose ranked search and catalog APIs
provide a polished browser UI for exploration
optionally run live "Deep Search" queries against source systems for fresh results

Current Product Scope

The current implementation includes:

indexed catalog search over generated local JSON data
filter support for source, format, and task type
weighted ranking tuned for dataset discovery intent
live deep-search requests for broader source coverage
data quality scripts for validation and link checking
a legacy Python prototype kept in the repo for reference

Tech Stack

Frontend: React
API: Node.js + Express
Data pipeline: Python
Storage: generated JSON artifacts in data/

High-Level Architecture

Python source adapters fetch dataset metadata from Hugging Face, UCI, Kaggle, Data.gov, and GitHub.
The pipeline validates, deduplicates, and normalizes records into data/merged_datasets.json.
The index builder creates data/inverted_index.json for search-time retrieval.
The Node API reads those generated files directly and serves ranked search responses.
The React frontend calls the API and renders search, filters, and result cards.
Optional Deep Search bypasses the local index and queries live external sources at request time.

More detail lives in docs/ARCHITECTURE.md.

Repository Layout

Web Search Engine/
|-- backend/               Node API and live deep-search integrations
|-- crawler/               Legacy crawling utilities
|-- data/                  Seeded and generated catalog/index artifacts
|-- docs/                  Repository documentation
|-- frontend/              React application
|-- scripts/               Pipeline orchestration and maintenance scripts
|-- search/                Legacy Python search prototype
|-- search_engine/         Legacy Flask/Node experiment code
|-- sources/               Dataset source adapters and validation helpers
|-- .env.example
|-- CONTRIBUTING.md
`-- README.md

Supported Sources

Hugging Face
UCI Machine Learning Repository
Kaggle
Data.gov
GitHub repositories with dataset signals

Some adapters work with graceful fallbacks or partial coverage when credentials or network access are limited. Deep Search coverage is intentionally broader but less controlled than the curated local index.

Search Experience

The indexed search flow prioritizes:

title matches
tags
task types
descriptions
formats

The API also applies:

phrase boosts
source trust weighting
intent expansions for specific query themes
penalties for weak or mismatched results

This makes the local search experience stronger than a plain keyword lookup while staying easy to operate from flat files.

API Summary

The primary API lives in backend/server.js.

GET /api/health
GET /api/search?q=...
GET /api/datasets?page=1&limit=12
GET /api/sources
GET /api/deep-search?q=...
POST /api/reload

Quick Start

1. Install frontend dependencies

cd frontend
npm install

2. Install backend dependencies

cd backend
npm install

3. Install Python dependencies

pip install nltk huggingface_hub ucimlrepo

Optional environment variables:

PORT
KAGGLE_USERNAME
KAGGLE_KEY
GITHUB_TOKEN

Start from .env.example.

4. Run the backend

cd backend
npm start

5. Run the frontend

cd frontend
npm start

The React app uses a development proxy to http://localhost:3001.

Data Pipeline Commands

Fetch, clean, and merge data:

python scripts/fetch_all.py

Validate merged datasets:

python scripts/validate_datasets.py

Check dataset links:

python scripts/check_links.py

Build the inverted index:

python scripts/build_index.py

Run the full rebuild flow:

python scripts/rebuild_pipeline.py

Generated Artifacts

Common generated files in data/:

merged_datasets.json
inverted_index.json
fetch_report.json
validation_report.json
link_check_report.json
index_report.json

These artifacts are environment-dependent. Counts may differ between runs depending on network access, third-party API behavior, credentials, and validation outcomes.

Legacy Areas

The repository contains older prototype code in search/, search_engine/, and parts of crawler/.

For active development, treat these as reference or historical implementation paths unless you are explicitly reviving them. The current product path is:

frontend/
backend/
scripts/
sources/
data/

Recommended GitHub Usage

This repository is best presented on GitHub with:

this README as the entry point
CONTRIBUTING.md for contributor guidance
docs/ARCHITECTURE.md for engineering context
docs/OPERATIONS.md for maintenance and runbook notes

Known Gaps

no formal CI workflow is defined in this repository yet
backend automated tests are not present
frontend still includes CRA default test scaffolding
Kaggle live search is currently query-mapped rather than fully API-driven
legacy directories should eventually be archived, documented further, or removed

Next Documentation Improvements

add a LICENSE file once the intended license is confirmed
add issue and pull request templates
add screenshots or a short demo GIF for the UI
document deployment expectations if the app will be hosted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Search Engine

Repository Purpose

Current Product Scope

Tech Stack

High-Level Architecture

Repository Layout

Supported Sources

Search Experience

API Summary

Quick Start

1. Install frontend dependencies

2. Install backend dependencies

3. Install Python dependencies

4. Run the backend

5. Run the frontend

Data Pipeline Commands

Generated Artifacts

Legacy Areas

Recommended GitHub Usage

Known Gaps

Next Documentation Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
crawler		crawler
data		data
docs		docs
frontend		frontend
indexer		indexer
scripts		scripts
search		search
search_engine		search_engine
sources		sources
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Dataset Search Engine

Repository Purpose

Current Product Scope

Tech Stack

High-Level Architecture

Repository Layout

Supported Sources

Search Experience

API Summary

Quick Start

1. Install frontend dependencies

2. Install backend dependencies

3. Install Python dependencies

4. Run the backend

5. Run the frontend

Data Pipeline Commands

Generated Artifacts

Legacy Areas

Recommended GitHub Usage

Known Gaps

Next Documentation Improvements

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages