Skip to content

Fix extract_md_blocks to preserve original code block whitespace #6

@chigwell

Description

@chigwell

User Story
As a developer using mdextractor, I want code block extraction to preserve original leading/trailing whitespace (except newlines) so that valid code structures with intentional indentation or alignment aren’t altered.

Background
The current implementation of extract_md_blocks in mdextractor/__init__.py uses block.strip(), which removes all leading/trailing whitespace, including spaces and tabs. This is problematic for code blocks where whitespace is syntactically significant (e.g., Python indentation, YAML formatting). For example, a code block starting with def example(): loses its leading spaces, rendering it invalid. The regex pattern r"```(?:\w+\s+)?(.*?)```" already captures the content correctly, but the aggressive stripping erases meaningful data.

Acceptance Criteria

  • Modify extract_md_blocks in mdextractor/__init__.py to use block.strip('\n') instead of block.strip().
  • Update the function’s docstring to clarify that only leading/trailing newlines are removed, not other whitespace.
  • Add test cases to tests/test_mdextractor.py verifying:
    • Code blocks retain leading/trailing spaces/tabs (e.g., " code\n " becomes " code").
    • Leading/trailing newlines are stripped (e.g., "\ncode\n" becomes "code").
  • Ensure existing tests (e.g., language specifiers, malformed fences) still pass after the change.
  • Validate edge cases like blocks containing only whitespace (e.g., " \n "" ").

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions