Summary
Region.findText() returns null on low-contrast UI text (e.g. grey-on-white inactive tab labels) that Region.text() transcribes correctly. Both paths share the same TextRecognizer.optimize() pre-processing, but they diverge at the Tess4J call: getWords() filters by confidence to avoid false positives — and the legitimately-low confidence on faint glyphs gets rejected — while doOCR() keeps everything in the transcript. It is not, strictly, a bug. It is a precision-vs-recall tradeoff that calls for an architectural refinement, not a patch.
Steps to reproduce
1. Open https://coinmarketcap.com/ (or any modern UI with grey-on-white inactive tabs)
2. From a Jython script, run:
region = Region(960, 0, 960, 600)
m = region.findText("Trending") # → null
tx = region.text() # → contains "Trending" in the transcript
3. Observe: the tab is clearly visible to a human and to read_text_in_region's transcription, but find_text refuses to localise it.
Expected behavior
Region.findText("Trending") should return a Match with bounding box coordinates for any text a human can read in the region — including low-contrast glyphs that Region.text() already transcribes correctly.
Actual behavior
findText returns null. The mismatch lives strictly between two Tess4J entry points called downstream of the identical optimize():
|
Region.text() |
Region.findText() |
| Inner path |
OCR.readText() |
Finder.doFindText() → OCR.readLines() |
| Tess4J call |
doOCR() — full transcript |
getWords() — bbox + per-word confidence |
| Confidence filter |
none, keeps everything |
rejects below internal threshold |
| PSM |
SINGLE_BLOCK (6) forced |
AUTO (3) default |
The grey-on-white "Trending" glyph clears Tesseract LSTM's character recognition (it appears in the transcript), but its computed per-word confidence is in the 40–50 % range. getWords() discards it as a precaution against returning bounding boxes you would not want to click on — which is the right default for visual automation, where a false positive at the wrong coordinates costs more than a false negative.
Minimal reproducer (script)
from sikuli import *
# Open any web page with low-contrast tab labels (CoinMarketCap, Yahoo Finance, …).
# Pick a region containing such a tab. "Trending" on coinmarketcap.com is a perfect specimen.
region = Region(960, 0, 960, 600)
# Path 1 — findText (Finder.doFindText → OCR.readLines → Tess4J getWords)
match = region.findText("Trending")
print "findText:", match # → null
# Path 2 — text (OCR.readText → Tess4J doOCR)
full = region.text()
print "text():", full # → "Top Trending Watchlist Stocks Most Visited ..."
# The word IS read by Tesseract. The word is NOT returned by findText.
# Both paths run through the same TextRecognizer.optimize() preprocessing.
# The split is at the Tess4J entry: getWords (confidence-filtered) vs doOCR (transcript).
Operating system
Windows 10 (reproduced on my side — but the bug lives in Finder.doFindText() / getWords() confidence filter, which is platform-independent Java + Tesseract code, so reproduction on SUSE, Ubuntu, macOS is expected; field confirmation welcome)
Java version
openjdk 25.0.2 2026-04-15
OculiX version / artifact
oculixide-3.0.4-complete-win.jar (built from claude/i18n-phase3)
Where does the bug happen?
API (Screen / Region / Pattern / Match), PaddleOCR / Tesseract
Logs / console output
# MCP find_text on the same region
> oculix_find_text("Trending", region={x:960,y:0,width:960,height:600})
{"found":false,"engine":"sikulix-region"}
# MCP read_text_in_region on a serrer scope
> oculix_read_text_in_region(region={x:960,y:240,width:960,height:60})
{"engine":"sikulix-region","text":"Top Trending Watchlist Stocks Most Visited New Gainers @ Rehypo >","lines":["Top Trending Watchlist Stocks Most Visited New Gainers @ Rehypo >"]}
# Cross-checked with Opus 4.8 on the same screen, same region, same workspace:
# identical "found:false" from find_text, identical successful transcript from read_text_in_region.
# Reproducibility is independent of the model wrapping the calls.
Additional context
The steppe analogy first, the code after
Asking find_text to localise "Trending" rendered in light grey on a white background is like asking the shepherd of the OculiX Pastoral Computing Suite™ to send his dog Aikash after Yak #42 in the steppe fog, three days of horseback from the WiFi router. The YAK MOOD PREDICTOR is 47 % confident "something bovine is grazing over there". The CHAMOIS DETECTION DRONE returns "biological presence detected, ungulate profile likely". But the shepherd refuses to release Aikash until SUSPICIOUS YAK ACTIVITY confirms the identity above 95 %. And he is right: sending Aikash to bite a parasite yak — possibly Yak #41 from neighbour Erlan, or worse, a dairy cow that wandered into the pasture by mistake — is a casus belli on the steppe, not a Ctrl+Z away. read_text_in_region, in this analogy, is only the observation drone: it produces the raw transcript "three ruminants down there, one likely yak, maybe two". Priceless for writing the daily report. Unusable for engaging.
Why this is not strictly a bug — and why an architectural refinement is the right answer anyway
The current behaviour is the correct default for the original SikuliX design assumption: a single OCR engine serving the visual-automation use case where the cost of a false positive (clicking at the wrong coordinates of a phantom word) is higher than the cost of a false negative (returning null and letting the caller retry or fall back). getWords()'s confidence filter is the implementation of that principle.
The opportunity surfaces when you observe that OculiX already knows the source of every image at the entry point — ScreenCapture, ADBScreen, VNCScreen are first-class types — and that the right preprocessing chain depends entirely on the source profile, not on a single one-size-fits-all optimize().
A draft chain dispatched by source profile
ScreenCapture (lossless, uniform illumination by construction)
→ Core.normalize(NORM_MINMAX) (global level stretch → "Trending" grey 150 becomes a dark enough grey)
→ light unsharp
→ Tesseract
// No CLAHE. No denoise. CLAHE corrects spatial illumination variation,
// which a rendered UI does not have — it would solve a problem that
// doesn't exist and pay in tile artefacts.
ADBScreen / VNCScreen (lossy compression by construction)
→ Imgproc.medianBlur(3) (kills JPEG ringing/blocking cheaply, edge-preserving)
→ CLAHE(clipLimit, tilesX/tilesY derived from region.h and expectedTextHeight)
→ unsharp
→ Tesseract
// CLAHE earns its keep here: real local variance from compression artefacts.
// The denoise comes FIRST, before any amplification step — you never
// amplify then clean.
Future camera / mobile capture (real spatial illumination variation)
→ bilateralFilter (or fastNlMeansDenoising)
→ CLAHE
→ unsharp
→ Tesseract
Raw Mat with no provenance (marginal case: Image.load(file))
→ blockiness metric on 8-pixel DCT grid (cheaper than a full FFT,
targeted at JPEG signature)
→ fallback on the ADB chain or the Screen chain depending on signature
The point that goes beyond a per-filter improvement: source-profile dispatch picks the entire chain, not a filter within a single chain. On the most common case — ScreenCapture of a rendered UI — CLAHE drops out of the path entirely, all its tuning evaporates with it, and the "Trending" issue is resolved by a one-line Core.normalize() with no tile artefacts and no parameters.
Public API: zero change
Region.findText(), Region.text(), OCR.readText(), OCR.readLines() keep their signatures. The dispatch happens in the TextRecognizer constructor, deciding the chain based on the concrete type of the source passed in. No migration for existing user scripts.
Three points open to discussion before I start
-
Where does SourceProfile live? Internal enum dispatched implicitly from the source's concrete type is the cleanest. But it makes it impossible for a user to override the chain on a stubborn case (someone who loads an Android screenshot as a generic Image and would benefit from the ADB chain). An optional explicit OCR.Options.sourceProfile(SourceProfile) opt-out preserves the override ability without poisoning the default. Worth the surface ?
-
Does ADBScreen really need its own chain? Argument for: the ADB pipeline is lossy by default (JPEG mode for performance). Argument against: a user can configure ADB in lossless PNG mode, in which case the ADB chain over-processes. Detection per-frame is expensive (see point 5 of the perf discussion below). Explicit flag on ADBScreen constructor, or autodetection from the first frame ?
-
Is medianBlur(3) always the right pre-CLAHE step on a lossy source? It excels on JPEG ringing/blocking. But certain VNC encoders (Tight ZRLE, Hextile RRE) produce different artefacts. I'd rather a fast median than a costly bilateral by default, but a cross-encoder mini-bench feels warranted before locking the chain.
A few opinions I've already cycled through (and discarded), for context
- A global confidence threshold knob exposed on
OCR.Options — solves the symptom by lowering the bar globally, lets false positives back in. Patches the threshold, not the cause.
- Auto-retry with stepped confidence (70 → 50 → 30) — three Tesseract calls instead of one, masks the cause in latency. Wrong layer for the fix.
- Injecting the searched word into Tesseract's
user_words dictionary — elegant in theory ("boost the confidence on the word I know I'm looking for"), but rebuilds the Tesseract config per call and is incompatible with findAllText use cases where the caller doesn't have a specific target.
- A CLAHE pre-processing added unconditionally in
optimize() — my first proposal. Solves a problem that doesn't exist on the most frequent case, pays in tile artefacts, over-processes screenshots. Wrong tool by default.
The source-profile dispatched chain is the one shape that survives every counter-argument I've thrown at it.
Pinging the people whose hand on this matters most
@RaiMan — this is the kind of refinement that lives in direct lineage of the optimize() you originally designed. The "single engine, one preprocessing path" was the right call for the design pressures of the time. The question of whether to dispatch by source profile touches on your original architectural call about OCR being source-agnostic. I will not push anything without your read on it.
@adriancostin6 — if you want to benchmark in real conditions, I can rough out the four chains in parallel inside a measurement harness (Trending on web, ADB device capture, VNC remote desktop) and you supply the numbers. Cross-OS measurement with confidence intervals on Trending recall + clickable-coordinate accuracy is exactly your terrain, and would settle the open questions above more reliably than any opinion I have.
The gecko likes it when a bug, on careful look, turns out to be a refactor opportunity — which is rare, and which is what makes the craft interesting.
🦎
Summary
Region.findText()returnsnullon low-contrast UI text (e.g. grey-on-white inactive tab labels) thatRegion.text()transcribes correctly. Both paths share the sameTextRecognizer.optimize()pre-processing, but they diverge at the Tess4J call:getWords()filters by confidence to avoid false positives — and the legitimately-low confidence on faint glyphs gets rejected — whiledoOCR()keeps everything in the transcript. It is not, strictly, a bug. It is a precision-vs-recall tradeoff that calls for an architectural refinement, not a patch.Steps to reproduce
Expected behavior
Region.findText("Trending")should return aMatchwith bounding box coordinates for any text a human can read in the region — including low-contrast glyphs thatRegion.text()already transcribes correctly.Actual behavior
findTextreturnsnull. The mismatch lives strictly between two Tess4J entry points called downstream of the identicaloptimize():Region.text()Region.findText()OCR.readText()Finder.doFindText()→OCR.readLines()doOCR()— full transcriptgetWords()— bbox + per-word confidenceSINGLE_BLOCK(6) forcedAUTO(3) defaultThe grey-on-white "Trending" glyph clears Tesseract LSTM's character recognition (it appears in the transcript), but its computed per-word confidence is in the 40–50 % range.
getWords()discards it as a precaution against returning bounding boxes you would not want to click on — which is the right default for visual automation, where a false positive at the wrong coordinates costs more than a false negative.Minimal reproducer (script)
Operating system
Windows 10 (reproduced on my side — but the bug lives in
Finder.doFindText()/getWords()confidence filter, which is platform-independent Java + Tesseract code, so reproduction on SUSE, Ubuntu, macOS is expected; field confirmation welcome)Java version
openjdk 25.0.2 2026-04-15
OculiX version / artifact
oculixide-3.0.4-complete-win.jar (built from
claude/i18n-phase3)Where does the bug happen?
API (Screen / Region / Pattern / Match), PaddleOCR / Tesseract
Logs / console output
Additional context
The steppe analogy first, the code after
Asking
find_textto localise "Trending" rendered in light grey on a white background is like asking the shepherd of the OculiX Pastoral Computing Suite™ to send his dog Aikash after Yak #42 in the steppe fog, three days of horseback from the WiFi router. The YAK MOOD PREDICTOR is 47 % confident "something bovine is grazing over there". The CHAMOIS DETECTION DRONE returns "biological presence detected, ungulate profile likely". But the shepherd refuses to release Aikash until SUSPICIOUS YAK ACTIVITY confirms the identity above 95 %. And he is right: sending Aikash to bite a parasite yak — possibly Yak #41 from neighbour Erlan, or worse, a dairy cow that wandered into the pasture by mistake — is a casus belli on the steppe, not a Ctrl+Z away.read_text_in_region, in this analogy, is only the observation drone: it produces the raw transcript "three ruminants down there, one likely yak, maybe two". Priceless for writing the daily report. Unusable for engaging.Why this is not strictly a bug — and why an architectural refinement is the right answer anyway
The current behaviour is the correct default for the original SikuliX design assumption: a single OCR engine serving the visual-automation use case where the cost of a false positive (clicking at the wrong coordinates of a phantom word) is higher than the cost of a false negative (returning null and letting the caller retry or fall back).
getWords()'s confidence filter is the implementation of that principle.The opportunity surfaces when you observe that OculiX already knows the source of every image at the entry point —
ScreenCapture,ADBScreen,VNCScreenare first-class types — and that the right preprocessing chain depends entirely on the source profile, not on a single one-size-fits-alloptimize().A draft chain dispatched by source profile
The point that goes beyond a per-filter improvement: source-profile dispatch picks the entire chain, not a filter within a single chain. On the most common case —
ScreenCaptureof a rendered UI — CLAHE drops out of the path entirely, all its tuning evaporates with it, and the "Trending" issue is resolved by a one-lineCore.normalize()with no tile artefacts and no parameters.Public API: zero change
Region.findText(),Region.text(),OCR.readText(),OCR.readLines()keep their signatures. The dispatch happens in theTextRecognizerconstructor, deciding the chain based on the concrete type of the source passed in. No migration for existing user scripts.Three points open to discussion before I start
Where does
SourceProfilelive? Internal enum dispatched implicitly from the source's concrete type is the cleanest. But it makes it impossible for a user to override the chain on a stubborn case (someone who loads an Android screenshot as a genericImageand would benefit from the ADB chain). An optional explicitOCR.Options.sourceProfile(SourceProfile)opt-out preserves the override ability without poisoning the default. Worth the surface ?Does
ADBScreenreally need its own chain? Argument for: the ADB pipeline is lossy by default (JPEG mode for performance). Argument against: a user can configure ADB in lossless PNG mode, in which case the ADB chain over-processes. Detection per-frame is expensive (see point 5 of the perf discussion below). Explicit flag onADBScreenconstructor, or autodetection from the first frame ?Is
medianBlur(3)always the right pre-CLAHE step on a lossy source? It excels on JPEG ringing/blocking. But certain VNC encoders (Tight ZRLE, Hextile RRE) produce different artefacts. I'd rather a fast median than a costly bilateral by default, but a cross-encoder mini-bench feels warranted before locking the chain.A few opinions I've already cycled through (and discarded), for context
OCR.Options— solves the symptom by lowering the bar globally, lets false positives back in. Patches the threshold, not the cause.user_wordsdictionary — elegant in theory ("boost the confidence on the word I know I'm looking for"), but rebuilds the Tesseract config per call and is incompatible withfindAllTextuse cases where the caller doesn't have a specific target.optimize()— my first proposal. Solves a problem that doesn't exist on the most frequent case, pays in tile artefacts, over-processes screenshots. Wrong tool by default.The source-profile dispatched chain is the one shape that survives every counter-argument I've thrown at it.
Pinging the people whose hand on this matters most
@RaiMan — this is the kind of refinement that lives in direct lineage of the
optimize()you originally designed. The "single engine, one preprocessing path" was the right call for the design pressures of the time. The question of whether to dispatch by source profile touches on your original architectural call about OCR being source-agnostic. I will not push anything without your read on it.@adriancostin6 — if you want to benchmark in real conditions, I can rough out the four chains in parallel inside a measurement harness (Trending on web, ADB device capture, VNC remote desktop) and you supply the numbers. Cross-OS measurement with confidence intervals on Trending recall + clickable-coordinate accuracy is exactly your terrain, and would settle the open questions above more reliably than any opinion I have.
The gecko likes it when a bug, on careful look, turns out to be a refactor opportunity — which is rare, and which is what makes the craft interesting.
🦎