Improving Auto Detect Search

## Summary

The current **Auto Detect** search in **STTM-WEB** is not reliably finding expected results when users:

* type a **full/partial panktee**
* type in **common roman spellings** (e.g. `jo mange thakur`)
* type with **misspellings / phonetic variations** (e.g. `jooo menge thakoor`)
* type what they remember “loosely” instead of exact spelling

This creates a poor user experience because users often search from memory, and the search should still find the expected shabad/panktee.

We currently use **Meilisearch**. Meilisearch already supports typo tolerance and configurable relevancy (ranking rules + searchable attribute order), so we should improve our indexing + query strategy to better support transliteration and fuzzy matching. ([[meilisearch.com](https://meilisearch.com/docs/learn/relevancy/typo_tolerance_settings?utm_source=chatgpt.com)][1])

---

## Problem Statement

### Current issues observed

1. **Full panktee search is weak**

   * Searching a full line / near-full line does not reliably return the correct shabad/panktee.

2. **Roman transliteration queries often fail**

   * Example:

     * `jo mange thakur`
     * `tu data datar`
   * Users expect the known shabad to appear, but results are missing or ranked poorly.

3. **Mistyped / phonetic queries fail**

   * Example:

     * `jooo menge thakoor`
   * Search should still return the intended shabad.

4. **Search seems overly dependent on “first letter of each shabad” style matching**

   * This is useful as a fallback, but should not be the primary behavior for most user searches.

---

## Goal

Make Auto Detect search behave more like how humans search:

* If user types **roman transliteration** (even imperfectly), return the intended shabad.
* If user types **Gurmukhi**, return strong Gurbani matches.
* If user types **acrostic/first-letter style**, still support that as fallback.
* If user types **meaning/English recall phrase**, support that as lower-priority fallback.

---

## Desired Ranking Priority (Auto Detect)

### For Roman queries (Latin script)

1. **Strong transliteration match** (exact / prefix / fuzzy) ✅ highest priority
2. **Gurmukhi direct match** (if relevant)
3. **First-letter/acrostic Gurbani match**
4. **Meaning/translation match** (lower priority fallback)

### For Gurmukhi queries

1. **Direct Gurbani match** ✅ highest priority
2. **First-letter/acrostic Gurbani match**
3. **Transliteration match** (if useful)
4. **Meaning/translation match**

---

## Proposed Solution (Implementation Direction)

### 1) Expand indexed search fields (Meilisearch documents)

For each searchable unit (shabad / panktee / line), index multiple searchable fields, not just one.

Suggested fields (example names):

* `gurbani` (original Gurmukhi text)
* `gurbani_first_letters` (existing/derived acrostic)
* `transliteration` (current transliteration)
* `transliteration_normalized` (normalized roman text for fuzzy matching)
* `meaning` / `translation` (English meaning / gloss)
* (optional) `transliteration_aliases` (common spellings if available)

**Why:** Meilisearch relevancy depends heavily on which attributes are searchable and their order. Earlier searchable attributes are treated as more relevant. ([[meilisearch.com](https://meilisearch.com/docs/learn/relevancy/attribute_ranking_order?utm_source=chatgpt.com)][2])

---

### 2) Add query normalization (app-side) before sending to Meilisearch

Roman-input search quality will improve significantly if we normalize user input before querying.

Examples of normalization:

* trim extra spaces
* lowercase
* collapse repeated characters:

  * `jooo` → `joo` / `jo` (configurable)
* normalize common phonetic variants:

  * `thakoor` → `thakur`
  * `menge` → `mange` (if rule-based mapping is safe)
* remove punctuation/noise

This can be done in a conservative way (don’t over-normalize).

> Note: Meilisearch typo tolerance helps, but by default typo tolerance is limited based on word length (e.g., short words like `jo`, `tu` are less tolerant), so app-side normalization is important for roman transliteration use cases. ([[meilisearch.com](https://meilisearch.com/docs/learn/relevancy/typo_tolerance_settings?utm_source=chatgpt.com)][1])

---

### 3) Detect script type (Roman vs Gurmukhi) and run ranked search strategy

Add lightweight query classification:

* **Gurmukhi query**
* **Roman query**
* **Mixed query** (edge case)

Then run search in priority order (multi-pass or weighted merge):

#### Option A (recommended): Multi-pass search + merge in app layer

Run multiple queries (or searches against different fields) and merge results with explicit priority buckets.

Example for Roman query:

* Pass 1: transliteration + transliteration_normalized
* Pass 2: gurbani_first_letters
* Pass 3: meaning
* Merge + dedupe + preserve priority

This gives us deterministic behavior and avoids fighting global index settings.

#### Option B: Single-pass search with tuned searchable attribute order

Possible, but harder to make behave differently for Roman vs Gurmukhi queries.

---

### 4) Tune Meilisearch typo tolerance and relevancy settings

Meilisearch supports:

* typo tolerance settings
* ranking rules
* searchable attribute ordering ([[meilisearch.com](https://meilisearch.com/docs/learn/relevancy/typo_tolerance_settings?utm_source=chatgpt.com)][1])

We should evaluate:

* enabling/tuning typo tolerance on transliteration fields
* adjusting typo thresholds (`minWordSizeForTypos`) if needed
* ensuring transliteration fields participate in search and ranking correctly

---

### 5) Add observability for search quality (debug mode / logging)

For QA and tuning:

* log query
* detected script
* normalized query
* search passes run
* top N results returned
* which field matched (if available)
* ranking score (optional during dev/QA)

This will make it easier to iterate quickly and compare improvements.

(Meilisearch can return ranking scores when configured in search parameters, useful for debugging relevance tuning.) ([[meilisearch.com](https://meilisearch.com/docs/learn/relevancy/ranking_score?utm_source=chatgpt.com)][3])

---

## Acceptance Criteria

### Functional

* [ ] Searching **`jo mange thakur`** returns the expected shabad in top results (ideally top 1–3)
* [ ] Searching **`tu data datar`** returns the expected shabad in top results
* [ ] Searching typo-heavy roman input like **`jooo menge thakoor`** still returns the intended shabad in results
* [ ] Searching a **full or near-full panktee in Gurmukhi** reliably returns correct result(s)
* [ ] Existing **first-letter/acrostic search** continues to work
* [ ] Meaning-based search still works as fallback and does not overpower direct Gurbani/transliteration matches

### Ranking behavior

* [ ] Roman queries prioritize transliteration matches over first-letter matches
* [ ] Gurmukhi queries prioritize direct Gurbani matches over transliteration/meaning matches
* [ ] Results are deduplicated when same shabad matches across multiple passes/fields

### Quality / Regression

* [ ] No major regression in search response time (define threshold)
* [ ] No major regression in known existing STTM search flows
* [ ] Test coverage added for representative Roman/Gurmukhi/fuzzy cases

---

## Suggested Test Queries (Initial QA Set)

### Roman exact/common spellings

* `jo mange thakur`
* `tu data datar`
* `har har naam nidhan hai`

### Roman fuzzy/mistyped

* `jooo menge thakoor`
* `jo maange thakur`
* `too data datar`

### Gurmukhi

* (full panktee exact)
* (partial panktee)
* (first-letter style query)

### Meaning fallback

* `those who ask from You`
* `giver of gifts` (or other known meaning phrases)

---

## Out of Scope (for this ticket)

* Full semantic search / embeddings
* ML-based phonetic transliteration correction
* Personalized search ranking
* Cross-language intent understanding beyond current indexed fields

---

## Implementation Notes / Hints

* Start with **small controlled dataset** (few known shabads) for tuning.
* Compare before/after relevance using fixed benchmark queries.
* Prefer **incremental rollout**:

  1. Add fields + indexing
  2. Add query normalization
  3. Add script-aware ranking / multi-pass merge
  4. Tune typo tolerance + thresholds

---

## Why this matters

Users often remember:

* a few words,
* approximate roman spelling,
* a sound-alike version,
* or a meaning snippet.

Auto Detect should feel forgiving and intuitive — especially for Sangat searching from memory.

[1]: https://meilisearch.com/docs/learn/relevancy/typo_tolerance_settings?utm_source=chatgpt.com "Typo tolerance settings - Meilisearch Documentation"
[2]: https://meilisearch.com/docs/learn/relevancy/attribute_ranking_order?utm_source=chatgpt.com "Attribute ranking order - Meilisearch Documentation"
[3]: https://meilisearch.com/docs/learn/relevancy/ranking_score?utm_source=chatgpt.com "Ranking score - Meilisearch Documentation"

Uh oh!

Improving Auto Detect Search #1857

Description

Summary

Problem Statement

Current issues observed

Goal

Desired Ranking Priority (Auto Detect)

For Roman queries (Latin script)

For Gurmukhi queries

Proposed Solution (Implementation Direction)

1) Expand indexed search fields (Meilisearch documents)

2) Add query normalization (app-side) before sending to Meilisearch

3) Detect script type (Roman vs Gurmukhi) and run ranked search strategy

Option A (recommended): Multi-pass search + merge in app layer

Option B: Single-pass search with tuned searchable attribute order

4) Tune Meilisearch typo tolerance and relevancy settings

5) Add observability for search quality (debug mode / logging)

Acceptance Criteria

Functional

Ranking behavior

Quality / Regression

Suggested Test Queries (Initial QA Set)

Roman exact/common spellings

Roman fuzzy/mistyped

Gurmukhi

Meaning fallback

Out of Scope (for this ticket)

Implementation Notes / Hints

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions