Skip to content

[Bug]: find_text returns false on low-contrast UI text that read_text_in_region transcribes cleanly — confidence threshold vs source profile #404

Description

@julienmerconsulting

Severity
Surface
Reproducible
Calls_for

Summary

Region.findText() returns null on low-contrast UI text (e.g. grey-on-white inactive tab labels) that Region.text() transcribes correctly. Both paths share the same TextRecognizer.optimize() pre-processing, but they diverge at the Tess4J call: getWords() filters by confidence to avoid false positives — and the legitimately-low confidence on faint glyphs gets rejected — while doOCR() keeps everything in the transcript. It is not, strictly, a bug. It is a precision-vs-recall tradeoff that calls for an architectural refinement, not a patch.

Steps to reproduce

1. Open https://coinmarketcap.com/ (or any modern UI with grey-on-white inactive tabs)
2. From a Jython script, run:
   region = Region(960, 0, 960, 600)
   m  = region.findText("Trending")    # → null
   tx = region.text()                  # → contains "Trending" in the transcript
3. Observe: the tab is clearly visible to a human and to read_text_in_region's transcription, but find_text refuses to localise it.

Expected behavior

Region.findText("Trending") should return a Match with bounding box coordinates for any text a human can read in the region — including low-contrast glyphs that Region.text() already transcribes correctly.

Actual behavior

findText returns null. The mismatch lives strictly between two Tess4J entry points called downstream of the identical optimize():

Region.text() Region.findText()
Inner path OCR.readText() Finder.doFindText()OCR.readLines()
Tess4J call doOCR() — full transcript getWords() — bbox + per-word confidence
Confidence filter none, keeps everything rejects below internal threshold
PSM SINGLE_BLOCK (6) forced AUTO (3) default

The grey-on-white "Trending" glyph clears Tesseract LSTM's character recognition (it appears in the transcript), but its computed per-word confidence is in the 40–50 % range. getWords() discards it as a precaution against returning bounding boxes you would not want to click on — which is the right default for visual automation, where a false positive at the wrong coordinates costs more than a false negative.

Minimal reproducer (script)

from sikuli import *

# Open any web page with low-contrast tab labels (CoinMarketCap, Yahoo Finance, …).
# Pick a region containing such a tab. "Trending" on coinmarketcap.com is a perfect specimen.
region = Region(960, 0, 960, 600)

# Path 1 — findText (Finder.doFindText → OCR.readLines → Tess4J getWords)
match = region.findText("Trending")
print "findText:", match  # → null

# Path 2 — text (OCR.readText → Tess4J doOCR)
full = region.text()
print "text():", full     # → "Top  Trending  Watchlist  Stocks  Most Visited  ..."

# The word IS read by Tesseract. The word is NOT returned by findText.
# Both paths run through the same TextRecognizer.optimize() preprocessing.
# The split is at the Tess4J entry: getWords (confidence-filtered) vs doOCR (transcript).

Operating system

Windows 10 (reproduced on my side — but the bug lives in Finder.doFindText() / getWords() confidence filter, which is platform-independent Java + Tesseract code, so reproduction on SUSE, Ubuntu, macOS is expected; field confirmation welcome)

Java version

openjdk 25.0.2 2026-04-15

OculiX version / artifact

oculixide-3.0.4-complete-win.jar (built from claude/i18n-phase3)

Where does the bug happen?

API (Screen / Region / Pattern / Match), PaddleOCR / Tesseract

Logs / console output

# MCP find_text on the same region
> oculix_find_text("Trending", region={x:960,y:0,width:960,height:600})
{"found":false,"engine":"sikulix-region"}

# MCP read_text_in_region on a serrer scope
> oculix_read_text_in_region(region={x:960,y:240,width:960,height:60})
{"engine":"sikulix-region","text":"Top  Trending Watchlist Stocks Most Visited New  Gainers @ Rehypo >","lines":["Top  Trending Watchlist Stocks Most Visited New  Gainers @ Rehypo >"]}

# Cross-checked with Opus 4.8 on the same screen, same region, same workspace:
# identical "found:false" from find_text, identical successful transcript from read_text_in_region.
# Reproducibility is independent of the model wrapping the calls.

Additional context

The steppe analogy first, the code after

Asking find_text to localise "Trending" rendered in light grey on a white background is like asking the shepherd of the OculiX Pastoral Computing Suite™ to send his dog Aikash after Yak #42 in the steppe fog, three days of horseback from the WiFi router. The YAK MOOD PREDICTOR is 47 % confident "something bovine is grazing over there". The CHAMOIS DETECTION DRONE returns "biological presence detected, ungulate profile likely". But the shepherd refuses to release Aikash until SUSPICIOUS YAK ACTIVITY confirms the identity above 95 %. And he is right: sending Aikash to bite a parasite yak — possibly Yak #41 from neighbour Erlan, or worse, a dairy cow that wandered into the pasture by mistake — is a casus belli on the steppe, not a Ctrl+Z away. read_text_in_region, in this analogy, is only the observation drone: it produces the raw transcript "three ruminants down there, one likely yak, maybe two". Priceless for writing the daily report. Unusable for engaging.

Why this is not strictly a bug — and why an architectural refinement is the right answer anyway

The current behaviour is the correct default for the original SikuliX design assumption: a single OCR engine serving the visual-automation use case where the cost of a false positive (clicking at the wrong coordinates of a phantom word) is higher than the cost of a false negative (returning null and letting the caller retry or fall back). getWords()'s confidence filter is the implementation of that principle.

The opportunity surfaces when you observe that OculiX already knows the source of every image at the entry pointScreenCapture, ADBScreen, VNCScreen are first-class types — and that the right preprocessing chain depends entirely on the source profile, not on a single one-size-fits-all optimize().

A draft chain dispatched by source profile

ScreenCapture                  (lossless, uniform illumination by construction)
  → Core.normalize(NORM_MINMAX)       (global level stretch → "Trending" grey 150 becomes a dark enough grey)
  → light unsharp
  → Tesseract
  // No CLAHE. No denoise. CLAHE corrects spatial illumination variation,
  // which a rendered UI does not have — it would solve a problem that
  // doesn't exist and pay in tile artefacts.

ADBScreen / VNCScreen          (lossy compression by construction)
  → Imgproc.medianBlur(3)               (kills JPEG ringing/blocking cheaply, edge-preserving)
  → CLAHE(clipLimit, tilesX/tilesY derived from region.h and expectedTextHeight)
  → unsharp
  → Tesseract
  // CLAHE earns its keep here: real local variance from compression artefacts.
  // The denoise comes FIRST, before any amplification step — you never
  // amplify then clean.

Future camera / mobile capture (real spatial illumination variation)
  → bilateralFilter (or fastNlMeansDenoising)
  → CLAHE
  → unsharp
  → Tesseract

Raw Mat with no provenance     (marginal case: Image.load(file))
  → blockiness metric on 8-pixel DCT grid  (cheaper than a full FFT,
                                             targeted at JPEG signature)
  → fallback on the ADB chain or the Screen chain depending on signature

The point that goes beyond a per-filter improvement: source-profile dispatch picks the entire chain, not a filter within a single chain. On the most common case — ScreenCapture of a rendered UI — CLAHE drops out of the path entirely, all its tuning evaporates with it, and the "Trending" issue is resolved by a one-line Core.normalize() with no tile artefacts and no parameters.

Public API: zero change

Region.findText(), Region.text(), OCR.readText(), OCR.readLines() keep their signatures. The dispatch happens in the TextRecognizer constructor, deciding the chain based on the concrete type of the source passed in. No migration for existing user scripts.

Three points open to discussion before I start

  1. Where does SourceProfile live? Internal enum dispatched implicitly from the source's concrete type is the cleanest. But it makes it impossible for a user to override the chain on a stubborn case (someone who loads an Android screenshot as a generic Image and would benefit from the ADB chain). An optional explicit OCR.Options.sourceProfile(SourceProfile) opt-out preserves the override ability without poisoning the default. Worth the surface ?

  2. Does ADBScreen really need its own chain? Argument for: the ADB pipeline is lossy by default (JPEG mode for performance). Argument against: a user can configure ADB in lossless PNG mode, in which case the ADB chain over-processes. Detection per-frame is expensive (see point 5 of the perf discussion below). Explicit flag on ADBScreen constructor, or autodetection from the first frame ?

  3. Is medianBlur(3) always the right pre-CLAHE step on a lossy source? It excels on JPEG ringing/blocking. But certain VNC encoders (Tight ZRLE, Hextile RRE) produce different artefacts. I'd rather a fast median than a costly bilateral by default, but a cross-encoder mini-bench feels warranted before locking the chain.

A few opinions I've already cycled through (and discarded), for context

  • A global confidence threshold knob exposed on OCR.Options — solves the symptom by lowering the bar globally, lets false positives back in. Patches the threshold, not the cause.
  • Auto-retry with stepped confidence (70 → 50 → 30) — three Tesseract calls instead of one, masks the cause in latency. Wrong layer for the fix.
  • Injecting the searched word into Tesseract's user_words dictionary — elegant in theory ("boost the confidence on the word I know I'm looking for"), but rebuilds the Tesseract config per call and is incompatible with findAllText use cases where the caller doesn't have a specific target.
  • A CLAHE pre-processing added unconditionally in optimize() — my first proposal. Solves a problem that doesn't exist on the most frequent case, pays in tile artefacts, over-processes screenshots. Wrong tool by default.

The source-profile dispatched chain is the one shape that survives every counter-argument I've thrown at it.

Pinging the people whose hand on this matters most

@RaiMan — this is the kind of refinement that lives in direct lineage of the optimize() you originally designed. The "single engine, one preprocessing path" was the right call for the design pressures of the time. The question of whether to dispatch by source profile touches on your original architectural call about OCR being source-agnostic. I will not push anything without your read on it.

@adriancostin6 — if you want to benchmark in real conditions, I can rough out the four chains in parallel inside a measurement harness (Trending on web, ADB device capture, VNC remote desktop) and you supply the numbers. Cross-OS measurement with confidence intervals on Trending recall + clickable-coordinate accuracy is exactly your terrain, and would settle the open questions above more reliably than any opinion I have.

The gecko likes it when a bug, on careful look, turns out to be a refactor opportunity — which is rare, and which is what makes the craft interesting.

🦎

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Fields

    No fields configured for Bug.

    Projects

    Status
    📋 Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions