feat: extending cache activation runner options for tokenization, batching, and shuffling#612
feat: extending cache activation runner options for tokenization, batching, and shuffling#612
Conversation
PR Review: Harmonize cache activation runner options with LLM runnerSummaryThis PR adds three new configuration options to |
Code Quality & Best Practices ✅Strengths:
Minor Suggestions:
|
Potential Bugs & Issues
|
Performance Considerations 📊Shuffle Performance:
Memory Efficiency:
|
Security Concerns 🔒Low Risk Overall:
|
Test Coverage 🧪Excellent coverage for the shuffle fix: Missing coverage: ❌ No error handling tests for the shuffle operation (e.g., what happens if disk is full during shuffle?) Recommendation: Add a simple integration test to verify the new parameters work end-to-end. |
Overall AssessmentScore: 8.5/10 This is a well-implemented PR that addresses real issues:
Recommendations before merge:
The bug fix alone makes this PR valuable, and the new parameters are a nice addition for consistency. With the error handling improvement for the shuffle operations, this would be ready to merge! |
Code ReviewThank you for this PR! The changes look solid overall. Here's my detailed feedback: ✅ Positive Aspects
🐛 Potential IssuesCritical: Missing
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Code Review for PR #612SummaryThis PR adds two new configuration options (disable_concat_sequences and sequence_separator_token) to harmonize the cache activations runner with the LLM training runner, and fixes a bug where shuffled datasets were not being saved to disk. Overall, the implementation is solid with good test coverage. Positive Aspects
Issues and Concerns1. File System Race Condition Risk (Medium Priority)Location: cache_activations_runner.py:294-309 The shuffle-to-disk implementation has a potential issue with atomicity. If the process crashes between deleting old files and moving new files, the dataset could be left in a corrupted state. Recommendation: Use atomic rename operations to ensure the original data remains intact until the new version is ready. Consider renaming the old directory to a backup, then renaming the new directory to the final location, then removing the backup. 2. Missing Test for New Config Parameters (Medium Priority)While the shuffle fix is well-tested, there are no tests that verify disable_concat_sequences and sequence_separator_token actually work correctly when passed to the cache runner. Recommendation: Add a test that verifies these parameters are properly passed to the ActivationsStore. 3. Performance ImplicationsThe shuffle-to-disk operation now involves saving the entire dataset to a temp location, deleting original files, moving all files from temp to final location, and reloading the dataset. For large datasets (multi-GB), this could be slow and disk-intensive. Recommendation: Add a log message about the shuffle-to-disk operation so users understand why it might take time for large datasets. Security ConcernsNo significant security issues identified. The file operations use proper path handling via Path objects. Best Practices and StyleGood:
Minor:
Recommendations SummaryMust Address:
Should Consider:
Overall AssessmentThis is a well-implemented PR that addresses a real bug and improves API consistency. The main concern is the file operation safety in the shuffle-to-disk logic, which should be addressed before merging to prevent potential data corruption in edge cases. Estimated Risk: Medium (due to file system operations) |
Code Review - PR #612SummaryThis PR adds Positive Highlights1. Excellent Test CoverageThe PR includes comprehensive tests covering:
The tests are well-structured and use appropriate assertions to verify behavior. 2. Bug Fix - Shuffled Datasets Not SavedThe fix for the shuffling bug (lines 294-311 in 3. Good Documentation
Issues and Concerns1. Critical: Potential Data Loss in Shuffle Logic
|
Code Review - PR #612Thanks for this PR! Overall this is a solid implementation that adds useful features. Here's my detailed review: ✅ Strengths
🐛 Potential Issues1. File Handling Race Condition Risk (sae_lens/cache_activations_runner.py:296-311)The shuffling save-to-disk implementation has a potential issue: shuffled_path = final_cached_activation_path / ".shuffled"
dataset.save_to_disk(str(shuffled_path))
# Remove old unshuffled data and replace with shuffled
for item in final_cached_activation_path.iterdir():
# ... delete files
for item in shuffled_path.iterdir():
shutil.move(str(item), str(final_cached_activation_path / item.name))
shuffled_path.rmdir()Issues:
Recommendation: Use a more robust approach: import tempfile
# Create temp directory with unique name
with tempfile.TemporaryDirectory(dir=final_cached_activation_path.parent) as tmp_dir:
shuffled_path = Path(tmp_dir) / "shuffled"
dataset.save_to_disk(str(shuffled_path))
# Atomic rename operation
backup_path = final_cached_activation_path.parent / f"{final_cached_activation_path.name}.backup"
final_cached_activation_path.rename(backup_path)
try:
shuffled_path.rename(final_cached_activation_path)
shutil.rmtree(backup_path)
except Exception:
final_cached_activation_path.rename(backup_path) # rollback
raise2. Generator State Issue (sae_lens/cache_activations_runner.py:92, 354)The self._shuffle_generator = torch.Generator().manual_seed(self.cfg.seed)Issue: Each call to
Recommendation: Either:
3. Memory Usage Consideration (sae_lens/cache_activations_runner.py:293-311)The shuffle implementation loads the entire dataset, shuffles it, saves it, deletes old files, and reloads: dataset = dataset.shuffle(seed=self.cfg.seed) # loads to memory
dataset.save_to_disk(str(shuffled_path))
# ... file operations
dataset = Dataset.load_from_disk(str(final_cached_activation_path)) # loads againIssue: This could be memory-intensive for large cached activation datasets Recommendation: Consider using 4. Missing Import (sae_lens/cache_activations_runner.py:3)
🔧 Code Quality Suggestions1. Magic String (sae_lens/cache_activations_runner.py:296)
SHUFFLE_TEMP_DIR = ".shuffled"2. Typo Fix Already Applied ✅Good catch fixing "includeing" → "including" in the comment at line 229! 3. Shuffle Across Sequences SemanticsThe current implementation in
Question: Is this the intended behavior? Should Recommendation: Clarify in the config docstring exactly when/where shuffling occurs with 4. Missing Edge Case TestConsider adding a test for what happens when 🔒 SecurityNo security concerns identified. The code doesn't introduce any injection vulnerabilities or unsafe file operations beyond the atomicity concern mentioned above. ⚡ Performance
📝 Documentation Suggestions
SummaryRecommended Actions Before Merge:
The feature itself is valuable and the tests are comprehensive. With the above fixes, this will be a solid addition to the codebase! |
Code Review for PR #612SummaryThis PR adds important tokenization and shuffling options to the cache activations runner, bringing it in line with the standard LLM training runner. The implementation is generally solid with good test coverage, but there are a few areas that could be improved. ✅ StrengthsCode Quality
Test Coverage
Bug Fix
|
Code Review - PR #612Thank you for this comprehensive PR adding tokenization options and shuffling improvements to the cache activations runner! Overall, the implementation is solid with excellent test coverage. Here are my findings: ✅ Strengths
🐛 Potential Bugs & IssuesCritical: Memory Usage in
|
This PR adds
disable_concat_sequencesandsequence_separator_tokenoptions to the cache activations runner, so it's possible to cache activations with the same options that can be used for the standard LLM training runner. In addition, this PR adds ashuffle_across_sequencesoption to allow the runner to fully shuffle activations regardless of which sequence they come from.This PR also fixes a bug where the dataset saved to disk would not be shuffle even if
shuffle=Trueis set; only the dataset that is uploaded to huggingface and returned by the runner would be shuffled.