Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 2026-02-20 - [Optimized Telemetry Redaction and Sanitization]
**Learning:** Sequential `re.sub` calls are faster than combined regex callbacks for small pattern sets, but the biggest performance win comes from early-exit fast-paths (e.g., checking for `\x1b` or secret keywords) and proper ordering of truncation vs. redaction for large strings.
**Action:** Always implement fast-path guards for expensive string processing and ensure that heavy operations (like regex) are performed on the smallest possible data subset (e.g., after truncation).

## 2024-05-12 - [Optimized Validation Pipeline]
**Learning:** Keyword-based fast-path checks for secret detection yield ~6.5x speedup for clean text by skipping regex engine overhead. Additionally, "".join(text.split()) is consistently faster than re.sub(r"\s+", "", text) for whitespace removal in Python.
**Action:** Always implement string-based early-exit guards for heavy regex operations in data-intensive loops.
4 changes: 0 additions & 4 deletions heidi_engine/telemetry.py
Original file line number Diff line number Diff line change
Expand Up @@ -732,10 +732,6 @@ def get_state(run_id: Optional[str] = None) -> Dict[str, Any]:
"usage": get_default_usage(),
}

# BOLT OPTIMIZATION: Check thread-safe state cache
cached = _state_cache.get(target_run_id, state_file)
if cached:
return cached

try:
with open(state_file) as f:
Expand Down
22 changes: 18 additions & 4 deletions scripts/02_validate_clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,15 @@
(r'(?i)pwd\s*[:=]\s*["\'][^"\']{8,}["\']', "password"),
]

# BOLT OPTIMIZATION: Pre-compile regex patterns for performance
_COMPILED_SECRET_PATTERNS = [(re.compile(p), t) for p, t in SECRET_PATTERNS]

# BOLT OPTIMIZATION: Fast-path indicators to skip expensive regex on clean text
_SECRET_INDICATORS = [
"api", "key", "token", "secret", "bearer", "akia", "private key", "openssh",
"mongodb", "postgres", "mysql", "redis", "ghp_", "glpat-", "sk-", "password", "pwd"
]
Comment on lines +91 to +95
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The current fast-path indicators list is missing a check for the high_entropy pattern (defined at line 82), which does not use keywords. This creates a security regression where long unlabeled secrets (like raw base64 tokens) will be skipped by the validation pipeline if they don't happen to be near a keyword like 'api' or 'key'. Additionally, using a list of strings for any() checks in a loop is less efficient than a single pre-compiled regex.

Suggested change
# BOLT OPTIMIZATION: Fast-path indicators to skip expensive regex on clean text
_SECRET_INDICATORS = [
"api", "key", "token", "secret", "bearer", "akia", "private key", "openssh",
"mongodb", "postgres", "mysql", "redis", "ghp_", "glpat-", "sk-", "password", "pwd"
]
# BOLT OPTIMIZATION: Fast-path regex to skip expensive checks on clean text
_SECRET_INDICATORS_RE = re.compile(
r"api|key|token|secret|bearer|akia|private\s+key|openssh|mongodb|postgres|mysql|redis|ghp_|glpat-|sk-|password|pwd|[a-zA-Z0-9_+/]{40,}",
re.IGNORECASE
)


# Fields to check for secrets
# TUNABLE: Add/remove fields based on your data structure
SECRET_CHECK_FIELDS = ["instruction", "input", "output", "response", "completion"]
Expand Down Expand Up @@ -207,9 +216,14 @@ def detect_secrets(sample: Dict[str, Any]) -> Tuple[bool, List[str]]:
continue

text = str(sample[field])
lower_text = text.lower()

# BOLT OPTIMIZATION: Fast-path check to skip regex if no indicators found
if not any(indicator in lower_text for indicator in _SECRET_INDICATORS):
continue
Comment on lines +219 to +223
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using text.lower() and any() inside a loop over dataset fields is suboptimal as it creates string copies and performs multiple substring searches in Python. Replacing this with a single search() call on a pre-compiled case-insensitive regex is significantly faster and more memory-efficient.

Suggested change
lower_text = text.lower()
# BOLT OPTIMIZATION: Fast-path check to skip regex if no indicators found
if not any(indicator in lower_text for indicator in _SECRET_INDICATORS):
continue
# BOLT OPTIMIZATION: Fast-path check to skip regex if no indicators found
if not _SECRET_INDICATORS_RE.search(text):
continue


for pattern, secret_type in SECRET_PATTERNS:
if re.search(pattern, text):
for pattern, secret_type in _COMPILED_SECRET_PATTERNS:
if pattern.search(text):
found_secrets.append(f"{field}:{secret_type}")

return len(found_secrets) > 0, found_secrets
Expand Down Expand Up @@ -275,8 +289,8 @@ def fuzzy_hash(sample: Dict[str, Any], n: int = 5) -> str:
- n=5 is a good balance for code data
"""
text = (sample.get("instruction", "") + sample.get("output", "")).lower()
# Remove whitespace for more robust matching
text = re.sub(r"\s+", "", text)
# BOLT OPTIMIZATION: Faster whitespace removal using split/join
text = "".join(text.split())

if len(text) < n:
return text
Expand Down