-
-
Notifications
You must be signed in to change notification settings - Fork 10
Description
User Story
As a developer using mdextractor, I want code block extraction to preserve original leading/trailing whitespace (except newlines) so that valid code structures with intentional indentation or alignment aren’t altered.
Background
The current implementation of extract_md_blocks in mdextractor/__init__.py uses block.strip(), which removes all leading/trailing whitespace, including spaces and tabs. This is problematic for code blocks where whitespace is syntactically significant (e.g., Python indentation, YAML formatting). For example, a code block starting with def example(): loses its leading spaces, rendering it invalid. The regex pattern r"```(?:\w+\s+)?(.*?)```" already captures the content correctly, but the aggressive stripping erases meaningful data.
Acceptance Criteria
- Modify
extract_md_blocksinmdextractor/__init__.pyto useblock.strip('\n')instead ofblock.strip(). - Update the function’s docstring to clarify that only leading/trailing newlines are removed, not other whitespace.
- Add test cases to
tests/test_mdextractor.pyverifying:- Code blocks retain leading/trailing spaces/tabs (e.g.,
" code\n "becomes" code"). - Leading/trailing newlines are stripped (e.g.,
"\ncode\n"becomes"code").
- Code blocks retain leading/trailing spaces/tabs (e.g.,
- Ensure existing tests (e.g., language specifiers, malformed fences) still pass after the change.
- Validate edge cases like blocks containing only whitespace (e.g.,
" \n "→" ").