Releases: laurentftech/KidSearch-Backend
Releases · laurentftech/KidSearch-Backend
v2.0.0 — Typesense
Breaking changes
- Replaced MeiliSearch with Typesense as the search engine
- Better performance, simpler configuration and native vector search support
⚠️ Migration required: re-index all content with the crawler after upgrading
New features
- Federated search: Typesense + Google CSE + Wiki (Vikidia) results combined and reranked
- Semantic reranking via a local HuggingFace TEI service — no external API required
- Knowledge panel API endpoint
- Authentication system: proxy (Caddy + authcrunch), OIDC, Google OAuth, GitHub OAuth, simple password
- Interactive setup:
make setupor one-linercurl | bashinstaller — 2 questions, secrets auto-generated - Monitoring dashboard: crawler control, live search testing, API metrics, logs viewer
- Multi-arch Docker image (linux/amd64, linux/arm64)
Improvements
- All-in-one Docker image (dashboard + API in a single container)
- Matrix CI testing across Python 3.10, 3.11 and 3.12
- Wiki documentation: Authentication, Production Deployment, Environment Variables reference
- Security: fixed CodeQL alerts (open redirect, clear-text secret logging)
Quick install
curl -fsSL https://raw.githubusercontent.com/laurentftech/KidSearch-Backend/main/scripts/install.sh | bashFull documentation: Wiki
Version 1.0.0 - Initial Release
Version 1.0.0 - Initial Release
This marks the first official release of the KidSearch Crawler, a high-performance, asynchronous web crawler designed to populate a Meilisearch instance with content from various web sources. This initial version provides a robust and flexible framework for data collection, featuring a rich set of capabilities to handle modern web environments efficiently and respectfully.
Key Features
- Asynchronous Crawling: Built with asyncio and aiohttp for high-speed, concurrent crawling of multiple sites.
- Flexible Data Sources: Supports both standard HTML websites and structured JSON APIs as content sources.
- Incremental Indexing: Utilizes a local cache to intelligently re-index only pages that have changed, significantly speeding up subsequent crawls.
- Crawl Resumption: Automatically saves its state and resumes crawling large sites that were not fully indexed in a previous session due to page limits.
- Intelligent Content Extraction: Leverages trafilatura for robust main content detection, with fallbacks to custom heuristics and manual CSS selectors for complex layouts.
- Multi-lingual Support: Automatically detects the language of HTML pages and allows manual setting for JSON sources, enabling language-specific filtering.
- Good Web Citizenship: Fully respects robots.txt directives, including Crawl-delay, and comes with a built-in list of common URL patterns to exclude (e.g., login pages, shopping carts).
- Rich Configuration: All crawl targets, rules, and parameters are managed through a single, easy-to-understand sites.yml file.
This release establishes a solid foundation for the KidSearch project's data indexing pipeline.