Skip to content

Releases: laurentftech/KidSearch-Backend

v2.0.0 — Typesense

20 Feb 23:08
v2.0.0
d0579b3

Choose a tag to compare

Breaking changes

  • Replaced MeiliSearch with Typesense as the search engine
    • Better performance, simpler configuration and native vector search support
    • ⚠️ Migration required: re-index all content with the crawler after upgrading

New features

  • Federated search: Typesense + Google CSE + Wiki (Vikidia) results combined and reranked
  • Semantic reranking via a local HuggingFace TEI service — no external API required
  • Knowledge panel API endpoint
  • Authentication system: proxy (Caddy + authcrunch), OIDC, Google OAuth, GitHub OAuth, simple password
  • Interactive setup: make setup or one-liner curl | bash installer — 2 questions, secrets auto-generated
  • Monitoring dashboard: crawler control, live search testing, API metrics, logs viewer
  • Multi-arch Docker image (linux/amd64, linux/arm64)

Improvements

  • All-in-one Docker image (dashboard + API in a single container)
  • Matrix CI testing across Python 3.10, 3.11 and 3.12
  • Wiki documentation: Authentication, Production Deployment, Environment Variables reference
  • Security: fixed CodeQL alerts (open redirect, clear-text secret logging)

Quick install

curl -fsSL https://raw.githubusercontent.com/laurentftech/KidSearch-Backend/main/scripts/install.sh | bash

Full documentation: Wiki

Version 1.0.0 - Initial Release

09 Oct 20:42
v1.0.0
789bd82

Choose a tag to compare

Version 1.0.0 - Initial Release

This marks the first official release of the KidSearch Crawler, a high-performance, asynchronous web crawler designed to populate a Meilisearch instance with content from various web sources. This initial version provides a robust and flexible framework for data collection, featuring a rich set of capabilities to handle modern web environments efficiently and respectfully.

Key Features

  • Asynchronous Crawling: Built with asyncio and aiohttp for high-speed, concurrent crawling of multiple sites.
  • Flexible Data Sources: Supports both standard HTML websites and structured JSON APIs as content sources.
  • Incremental Indexing: Utilizes a local cache to intelligently re-index only pages that have changed, significantly speeding up subsequent crawls.
  • Crawl Resumption: Automatically saves its state and resumes crawling large sites that were not fully indexed in a previous session due to page limits.
  • Intelligent Content Extraction: Leverages trafilatura for robust main content detection, with fallbacks to custom heuristics and manual CSS selectors for complex layouts.
  • Multi-lingual Support: Automatically detects the language of HTML pages and allows manual setting for JSON sources, enabling language-specific filtering.
  • Good Web Citizenship: Fully respects robots.txt directives, including Crawl-delay, and comes with a built-in list of common URL patterns to exclude (e.g., login pages, shopping carts).
  • Rich Configuration: All crawl targets, rules, and parameters are managed through a single, easy-to-understand sites.yml file.

This release establishes a solid foundation for the KidSearch project's data indexing pipeline.