Skip to content

Improving Auto Detect Search #1857

Description

@indersinghkhalis

Summary

The current Auto Detect search in STTM-WEB is not reliably finding expected results when users:

  • type a full/partial panktee
  • type in common roman spellings (e.g. jo mange thakur)
  • type with misspellings / phonetic variations (e.g. jooo menge thakoor)
  • type what they remember “loosely” instead of exact spelling

This creates a poor user experience because users often search from memory, and the search should still find the expected shabad/panktee.

We currently use Meilisearch. Meilisearch already supports typo tolerance and configurable relevancy (ranking rules + searchable attribute order), so we should improve our indexing + query strategy to better support transliteration and fuzzy matching. ([meilisearch.com]1)


Problem Statement

Current issues observed

  1. Full panktee search is weak

    • Searching a full line / near-full line does not reliably return the correct shabad/panktee.
  2. Roman transliteration queries often fail

    • Example:

      • jo mange thakur
      • tu data datar
    • Users expect the known shabad to appear, but results are missing or ranked poorly.

  3. Mistyped / phonetic queries fail

    • Example:

      • jooo menge thakoor
    • Search should still return the intended shabad.

  4. Search seems overly dependent on “first letter of each shabad” style matching

    • This is useful as a fallback, but should not be the primary behavior for most user searches.

Goal

Make Auto Detect search behave more like how humans search:

  • If user types roman transliteration (even imperfectly), return the intended shabad.
  • If user types Gurmukhi, return strong Gurbani matches.
  • If user types acrostic/first-letter style, still support that as fallback.
  • If user types meaning/English recall phrase, support that as lower-priority fallback.

Desired Ranking Priority (Auto Detect)

For Roman queries (Latin script)

  1. Strong transliteration match (exact / prefix / fuzzy) ✅ highest priority
  2. Gurmukhi direct match (if relevant)
  3. First-letter/acrostic Gurbani match
  4. Meaning/translation match (lower priority fallback)

For Gurmukhi queries

  1. Direct Gurbani match ✅ highest priority
  2. First-letter/acrostic Gurbani match
  3. Transliteration match (if useful)
  4. Meaning/translation match

Proposed Solution (Implementation Direction)

1) Expand indexed search fields (Meilisearch documents)

For each searchable unit (shabad / panktee / line), index multiple searchable fields, not just one.

Suggested fields (example names):

  • gurbani (original Gurmukhi text)
  • gurbani_first_letters (existing/derived acrostic)
  • transliteration (current transliteration)
  • transliteration_normalized (normalized roman text for fuzzy matching)
  • meaning / translation (English meaning / gloss)
  • (optional) transliteration_aliases (common spellings if available)

Why: Meilisearch relevancy depends heavily on which attributes are searchable and their order. Earlier searchable attributes are treated as more relevant. ([meilisearch.com]2)


2) Add query normalization (app-side) before sending to Meilisearch

Roman-input search quality will improve significantly if we normalize user input before querying.

Examples of normalization:

  • trim extra spaces

  • lowercase

  • collapse repeated characters:

    • jooojoo / jo (configurable)
  • normalize common phonetic variants:

    • thakoorthakur
    • mengemange (if rule-based mapping is safe)
  • remove punctuation/noise

This can be done in a conservative way (don’t over-normalize).

Note: Meilisearch typo tolerance helps, but by default typo tolerance is limited based on word length (e.g., short words like jo, tu are less tolerant), so app-side normalization is important for roman transliteration use cases. ([meilisearch.com]1)


3) Detect script type (Roman vs Gurmukhi) and run ranked search strategy

Add lightweight query classification:

  • Gurmukhi query
  • Roman query
  • Mixed query (edge case)

Then run search in priority order (multi-pass or weighted merge):

Option A (recommended): Multi-pass search + merge in app layer

Run multiple queries (or searches against different fields) and merge results with explicit priority buckets.

Example for Roman query:

  • Pass 1: transliteration + transliteration_normalized
  • Pass 2: gurbani_first_letters
  • Pass 3: meaning
  • Merge + dedupe + preserve priority

This gives us deterministic behavior and avoids fighting global index settings.

Option B: Single-pass search with tuned searchable attribute order

Possible, but harder to make behave differently for Roman vs Gurmukhi queries.


4) Tune Meilisearch typo tolerance and relevancy settings

Meilisearch supports:

  • typo tolerance settings
  • ranking rules
  • searchable attribute ordering ([meilisearch.com]1)

We should evaluate:

  • enabling/tuning typo tolerance on transliteration fields
  • adjusting typo thresholds (minWordSizeForTypos) if needed
  • ensuring transliteration fields participate in search and ranking correctly

5) Add observability for search quality (debug mode / logging)

For QA and tuning:

  • log query
  • detected script
  • normalized query
  • search passes run
  • top N results returned
  • which field matched (if available)
  • ranking score (optional during dev/QA)

This will make it easier to iterate quickly and compare improvements.

(Meilisearch can return ranking scores when configured in search parameters, useful for debugging relevance tuning.) ([meilisearch.com]3)


Acceptance Criteria

Functional

  • Searching jo mange thakur returns the expected shabad in top results (ideally top 1–3)
  • Searching tu data datar returns the expected shabad in top results
  • Searching typo-heavy roman input like jooo menge thakoor still returns the intended shabad in results
  • Searching a full or near-full panktee in Gurmukhi reliably returns correct result(s)
  • Existing first-letter/acrostic search continues to work
  • Meaning-based search still works as fallback and does not overpower direct Gurbani/transliteration matches

Ranking behavior

  • Roman queries prioritize transliteration matches over first-letter matches
  • Gurmukhi queries prioritize direct Gurbani matches over transliteration/meaning matches
  • Results are deduplicated when same shabad matches across multiple passes/fields

Quality / Regression

  • No major regression in search response time (define threshold)
  • No major regression in known existing STTM search flows
  • Test coverage added for representative Roman/Gurmukhi/fuzzy cases

Suggested Test Queries (Initial QA Set)

Roman exact/common spellings

  • jo mange thakur
  • tu data datar
  • har har naam nidhan hai

Roman fuzzy/mistyped

  • jooo menge thakoor
  • jo maange thakur
  • too data datar

Gurmukhi

  • (full panktee exact)
  • (partial panktee)
  • (first-letter style query)

Meaning fallback

  • those who ask from You
  • giver of gifts (or other known meaning phrases)

Out of Scope (for this ticket)

  • Full semantic search / embeddings
  • ML-based phonetic transliteration correction
  • Personalized search ranking
  • Cross-language intent understanding beyond current indexed fields

Implementation Notes / Hints

  • Start with small controlled dataset (few known shabads) for tuning.

  • Compare before/after relevance using fixed benchmark queries.

  • Prefer incremental rollout:

    1. Add fields + indexing
    2. Add query normalization
    3. Add script-aware ranking / multi-pass merge
    4. Tune typo tolerance + thresholds

Why this matters

Users often remember:

  • a few words,
  • approximate roman spelling,
  • a sound-alike version,
  • or a meaning snippet.

Auto Detect should feel forgiving and intuitive — especially for Sangat searching from memory.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions