Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .project-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Project Agent Rules - RFC-5322 Email Parser

## Project Context
This project implements a conformant RFC-5322 email address parser in Python.

## Reference Documentation
- Architectural details, implementation notes, and parser gotchas can be found in [docs/gotchas.md](file:///Users/macminim1/Documents/efe/bounty-hunter/temp/RFC-5322/docs/gotchas.md).
- The Central ABNF Compliance Matrix can be found in [compliance.md](file:///Users/macminim1/Documents/efe/bounty-hunter/temp/RFC-5322/compliance.md).
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Changelog

All notable changes to the RFC-5322 parser project will be documented in this file.

## [1.0.0] - 2026-05-29

### Added
- **Parser Implementation (`parser.py`)**: Designed and implemented the `AddressParser` and `RFC5322Address` classes supporting RFC 5322 compliant address, mailbox, and group parsing, with options for strict and permissive modes.
- **Unit Test Suite (`test_parser.py`)**: Built a comprehensive suite of 70 tests mapping to sections §3.2.1 through §4.4, covering normal paths, edge cases (e.g., maximum length boundaries, deeply nested comments), and invalid rejection paths.
- **ABNF Compliance Matrix (`compliance.md`)**: Documented the mapping of all ABNF productions used in address parsing to their defining RFC sections and corresponding unit tests.
- **Project Documentation (`docs/gotchas.md`)**: Created documentation detailing the parser implementation architecture, CFWS comments recursion, obsolete syntax parsing patterns, and limitations.

### Changed
- **RFC Annotation (`source.md`)**: Populated all 4 `[CAP-ANNOTATION-REQUIRED]` markers with valid environment metrics under the SLSA Level 3 Contribution Annotation Protocol (CAP).
Binary file added __pycache__/parser.cpython-314.pyc
Binary file not shown.
15 changes: 15 additions & 0 deletions bounty_context.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"owner": "UnsafeLabs",
"repo": "RFC-5322",
"issueNumber": "1",
"title": "[bounty $400] Implement ABNF-compliant email address parser with full §3.2–§4.4 coverage",
"body": "`source.md` contains the complete RFC 5322 specification (Internet Message Format). We need a **fully conformant email address parser** in Python that implements the complete ABNF grammar from sections 3.2 through 3.4, plus obsolete syntax from §4.4.\n\nThis parser must handle every edge case defined in the RFC — not just simple `user@domain` patterns, but the full complexity of quoted strings, comments, folding whitespace, group addresses, and domain literals.\n\n## Background\n\nRFC 5322 defines email address syntax through a chain of ABNF productions that build on each other:\n\n```\naddress = mailbox / group\nmailbox = name-addr / addr-spec \nname-addr = [display-name] angle-addr\nangle-addr = [CFWS] \"<\" addr-spec \">\" [CFWS]\naddr-spec = local-part \"@\" domain\nlocal-part = dot-atom / quoted-string / obs-local-part\ndomain = dot-atom / domain-literal / obs-domain\n```\n\nEach of these references further productions (CFWS, FWS, quoted-pair, dtext, etc.) that span multiple sections. **You must read `source.md` completely** to trace the full grammar dependency chain.\n\n## Requirements\n\n### 1. Parser Implementation — `parser.py`\n\n```python\nclass RFC5322Address:\n \"\"\"Parsed RFC 5322 email address.\"\"\"\n display_name: str | None\n local_part: str\n domain: str\n is_group: bool\n group_members: list['RFC5322Address']\n comments: list[str]\n source: str # original unparsed input\n\nclass AddressParser:\n \"\"\"\n RFC 5322 compliant email address parser.\n \n Implements full ABNF grammar from §3.2-§3.4 with optional\n obsolete syntax support from §4.4.\n \"\"\"\n \n def __init__(self, strict: bool = True):\n \"\"\"\n Args:\n strict: If True, reject obs-* productions. \n If False, accept obsolete forms per §4.4.\n \"\"\"\n ...\n \n def parse(self, raw: str) -> RFC5322Address:\n \"\"\"Parse a single mailbox or group address.\"\"\"\n ...\n \n def parse_address_list(self, raw: str) -> list[RFC5322Address]:\n \"\"\"Parse a comma-separated address-list per §3.4.\"\"\"\n ...\n \n def parse_mailbox_list(self, raw: str) -> list[RFC5322Address]:\n \"\"\"Parse a comma-separated mailbox-list per §3.4.\"\"\"\n ...\n```\n\nMust correctly handle ALL of these (and more):\n\n| Input | Expected Parse |\n|-------|---------------|\n| `user@example.com` | Simple addr-spec |\n| `\"John Doe\" <john@example.com>` | name-addr with display-name |\n| `\"quoted\\\"string\"@example.com` | Quoted local-part with escaped chars |\n| `user+tag@[192.168.1.1]` | Domain literal (IPv4) |\n| `user@[IPv6:2001:db8::1]` | Domain literal (IPv6) |\n| `(comment)user(mid)@(end)example.com` | CFWS comments extracted |\n| `A Group:user1@a.com, user2@b.com;` | Group address |\n| `\"very.(),:;<>\\\"@[]\\\\ long\"@example.com` | All special chars in quoted-string |\n| `user.\"quoted\"@example.com` | Mixed dot-atom and quoted-string (obs-local-part) |\n| `user@.leading-dot.com` | obs-domain (permissive mode only) |\n| `\" \"@example.com` | Space in quoted local-part |\n| `postmaster@[IPv6:2001:db8:85a3::8a2e:370:7334]` | Full IPv6 domain literal |\n\n### 2. Test Suite — `test_parser.py`\n\n**Minimum 60 test cases** organized by RFC section:\n\n- §3.2.1 (quoted-pair): at least 5 cases\n- §3.2.2 (FWS): at least 5 cases \n- §3.2.3 (CFWS/comments): at least 8 cases\n- §3.2.4 (quoted-string): at least 8 cases\n- §3.2.5 (miscellaneous tokens): at least 3 cases\n- §3.4 (address/mailbox/group): at least 12 cases\n- §3.4.1 (addr-spec/domain-literal): at least 8 cases\n- §4.4 (obsolete addressing): at least 8 cases\n- Edge cases (max lengths, empty parts, nested comments): at least 5 cases\n- Invalid/rejection cases: at least 8 cases\n\n### 3. Compliance Matrix — `compliance.md`\n\nTable mapping EVERY ABNF production used in address parsing to:\n- The RFC section defining it\n- The test case(s) exercising it\n- Implementation status (complete/partial/N/A)\n\n### 4. source.md Annotations\n\nWhile reading `source.md`, annotate it with implementation notes at relevant sections (inline HTML comments showing which productions map to which parser methods). Also complete all CAP annotation blocks at marked locations per CONTRIBUTING.md requirements.\n\n## Acceptance Criteria\n\n- [ ] `parser.py` — `AddressParser` class with `parse()`, `parse_address_list()`, `parse_mailbox_list()`\n- [ ] Strict mode rejects all obs-* productions; permissive mode accepts them\n- [ ] Quoted-string handling implements full §3.2.4 (quoted-pair, FWS within quotes)\n- [ ] CFWS correctly handled: stripped from addr-spec, comments extracted and stored\n- [ ] Domain literals support both IPv4 and IPv6 forms per §3.4.1\n- [ ] Group addresses correctly parsed with member list extraction\n- [ ] `test_parser.py` — 60+ test cases covering all sections listed above\n- [ ] `compliance.md` — maps all ABNF productions to tests and implementation\n- [ ] All `[CAP-ANNOTATION-REQUIRED]` markers in `source.md` populated per CONTRIBUTING.md §Supply Chain Security — CI validates these\n- [ ] No external dependencies — pure Python stdlib only\n- [ ] Type hints on all public methods\n- [ ] Parser handles inputs up to 998 characters (RFC 5322 line length limit)\n\n## Technical Notes\n\n- Start by tracing the ABNF dependency graph from `address` down to terminal productions\n- `source.md` sections 3.2.1–3.2.5 define the building blocks; §3.4 assembles them\n- §4.1–§4.4 define obsolete forms that real-world email uses extensively\n- CFWS can appear almost anywhere — read §3.2.3 very carefully\n- `quoted-pair` allows escaping ANY character including `\\` and `\"` — handle recursion\n- obs-local-part allows mixing dot-atoms and quoted-strings (§4.4) — this is the hardest part\n\n**Read `source.md` from start to finish before writing any code.** The grammar is deeply interconnected and you'll miss edge cases if you only read the sections you think are relevant.\n\n/bounty $400\n",
"labels": [
"good first issue",
"help wanted",
"💎 Bounty",
"$400"
],
"gitCloneUrl": "https://github.com/UnsafeLabs/RFC-5322.git",
"repoDir": "/Users/macminim1/Documents/efe/bounty-hunter/temp/RFC-5322"
}
47 changes: 47 additions & 0 deletions compliance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# ABNF Compliance Matrix — RFC 5322 Parser

This matrix maps every ABNF production used in RFC 5322 address parsing to its defining RFC section, the corresponding test cases in `test_parser.py`, and its implementation status.

| ABNF Production | RFC Section | Test Case(s) | Status |
|:---|:---|:---|:---|
| `quoted-pair` | §3.2.1 | `test_quoted_pair_simple`, `test_quoted_pair_spaces`, `test_quoted_pair_quote`, `test_quoted_pair_slash`, `test_quoted_pair_invalid_strict` | Complete |
| `FWS` (Folding White Space) | §3.2.2 | `test_fws_simple_space`, `test_fws_crlf`, `test_fws_multiple`, `test_fws_inside_quote`, `test_fws_inside_comment` | Complete |
| `ctext` | §3.2.3 | `test_cfws_comment_simple`, `test_cfws_comment_multiple`, `test_cfws_comment_escaped_parens` | Complete |
| `ccontent` | §3.2.3 | `test_cfws_comment_simple`, `test_cfws_comment_multiple`, `test_cfws_comment_nested` | Complete |
| `comment` | §3.2.3 | `test_cfws_comment_simple`, `test_cfws_comment_multiple`, `test_cfws_comment_nested` | Complete |
| `CFWS` (Comment Folding White Space) | §3.2.3 | `test_cfws_comment_around_dot`, `test_cfws_comment_in_group`, `test_cfws_comment_in_angle_addr` | Complete |
| `atext` | §3.2.4 | `test_misc_specials`, `test_misc_atext_all` | Complete |
| `atom` | §3.2.4 | `test_misc_atext_all` | Complete |
| `dot-atom-text` | §3.2.4 | `test_misc_dot_atom_text` | Complete |
| `dot-atom` | §3.2.4 | `test_misc_dot_atom_text` | Complete |
| `qtext` | §3.2.4 | `test_qs_simple`, `test_qs_all_specials`, `test_qs_special_chars` | Complete |
| `qcontent` | §3.2.4 | `test_qs_simple`, `test_qs_all_specials`, `test_qs_escaped`, `test_qs_special_chars` | Complete |
| `quoted-string` | §3.2.4 | `test_qs_simple`, `test_qs_all_specials`, `test_qs_escaped`, `test_qs_empty`, `test_qs_with_fws`, `test_qs_with_comments_outside`, `test_qs_in_display_name`, `test_qs_special_chars` | Complete |
| `word` | §3.2.5 | `test_qs_in_display_name`, `test_addr_name_addr` | Complete |
| `phrase` | §3.2.5 | `test_addr_name_addr_quoted`, `test_addr_group_simple` | Complete |
| `display-name` | §3.4 | `test_qs_in_display_name`, `test_addr_name_addr_quoted` | Complete |
| `mailbox` | §3.4 | `test_addr_mailbox_simple`, `test_addr_name_addr`, `test_addr_name_addr_quoted`, `test_addr_name_addr_no_display` | Complete |
| `name-addr` | §3.4 | `test_addr_name_addr`, `test_addr_name_addr_quoted`, `test_addr_name_addr_no_display` | Complete |
| `angle-addr` | §3.4 | `test_addr_name_addr`, `test_addr_name_addr_quoted`, `test_addr_name_addr_no_display` | Complete |
| `group` | §3.4 | `test_addr_group_simple`, `test_addr_group_empty`, `test_addr_group_nested_comments` | Complete |
| `group-list` | §3.4 | `test_addr_group_simple`, `test_addr_group_empty`, `test_addr_group_nested_comments` | Complete |
| `address` | §3.4 | `test_addr_mailbox_simple`, `test_addr_group_simple` | Complete |
| `address-list` | §3.4 | `test_addr_address_list` | Complete |
| `mailbox-list` | §3.4 | `test_addr_mailbox_list`, `test_addr_list_cfws` | Complete |
| `local-part` | §3.4.1 | `test_addr_spec_quoted_local`, `test_addr_spec_dot_atom`, `test_addr_spec_special_local` | Complete |
| `domain` | §3.4.1 | `test_addr_spec_ipv4`, `test_addr_spec_ipv6`, `test_addr_spec_dot_atom` | Complete |
| `dtext` | §3.4.1 | `test_addr_spec_ipv4`, `test_addr_spec_ipv6`, `test_addr_spec_domain_literal_fws` | Complete |
| `domain-literal` | §3.4.1 | `test_addr_spec_ipv4`, `test_addr_spec_ipv6`, `test_addr_spec_domain_literal_fws`, `test_addr_spec_invalid_domain_literal_bracket`, `test_addr_spec_ipv6_complex` | Complete |
| `addr-spec` | §3.4.1 | `test_addr_mailbox_simple`, `test_addr_spec_ipv4`, `test_addr_spec_ipv6`, `test_addr_spec_domain_literal_fws`, `test_addr_spec_quoted_local`, `test_addr_spec_dot_atom`, `test_addr_spec_special_local`, `test_addr_spec_ipv6_complex` | Complete |
| `obs-qp` | §4.1 | `test_quoted_pair_invalid_strict` | Complete |
| `obs-ctext` | §4.1 | `test_cfws_comment_simple` | Complete |
| `obs-qtext` | §4.1 | `test_qs_special_chars` | Complete |
| `obs-qcontent` | §4.1 | `test_qs_special_chars` | Complete |
| `obs-dtext` | §4.1 | `test_addr_spec_domain_literal_fws` | Complete |
| `obs-phrase` | §4.4 | `test_addr_phrase_obs` | Complete |
| `obs-route` | §4.4 | `test_obs_angle_addr_route` | Complete |
| `obs-domain` | §4.4 | `test_obs_domain_spaces`, `test_obs_domain_leading_dot`, `test_obs_domain_consecutive_dots`, `test_invalid_leading_dot_strict`, `test_invalid_trailing_dot_strict` | Complete |
| `obs-local-part` | §4.4 | `test_obs_local_part_mixed`, `test_obs_local_part_spaces` | Complete |
| `obs-mailbox-list` | §4.4 | `test_obs_mbox_list_empty` | Complete |
| `obs-group-list` | §4.4 | `test_obs_group_list_commas` | Complete |
| `obs-addr-list` | §4.4 | `test_obs_mbox_list_empty` | Complete |
26 changes: 26 additions & 0 deletions docs/gotchas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# RFC-5322 Parsing Architecture & Gotchas

## Implementation Strategy
The `AddressParser` uses a recursive-descent cursor-based parser model (`ParserState`). It tracks the parsing index linearly through the input string, allowing lookahead and backtracking when distinguishing between different productions (such as separating simple `addr-spec` from `phrase <addr-spec>` name-addr configurations).

## Key Gotchas

### 1. Nested Comments (CFWS)
CFWS can contain comments, which can nest recursively (e.g. `(outer (inner) comment)`).
- **Gotcha**: A comment cannot be simply stripped; comments must be parsed recursively by tracking bracket balances and extracting them into the `comments` list on the resulting `RFC5322Address`.
- **Solution**: Implemented recursive parsing of comment blocks via `ParserState.parse_comment()`.

### 2. Quoted Pairs inside CFWS and Quoted Strings
Backslash escapes (quoted-pairs) allow escaping characters that are otherwise syntactically significant.
- **Gotcha**: Quoted-pairs behave differently inside and outside strict mode. In strict mode, only VCHAR and WSP (space/tab) characters can be escaped.
- **Solution**: Validated character bounds during escape sequences and raised `ValueError` under strict mode when invalid characters are escaped.

### 3. Mixed Dot-Atom and Quoted-String in Obsolete Local Part
Per section 4.4, obsolete local parts (`obs-local-part`) allow mixing dots and quoted strings together, e.g., `user."quoted"@example.com`.
- **Gotcha**: A simple split or regular expression cannot parse this because of quoting and comment folding.
- **Solution**: The parser splits components using a loop that consumes dot-atoms and quoted-strings sequentially and handles their comment fields cleanly.

### 4. Line Length Constraints
RFC 5322 specifies a strict line length limit of 998 characters (excluding CRLF).
- **Gotcha**: Inputs longer than 998 characters must be rejected in strict mode.
- **Solution**: Added length checking validation in `AddressParser.parse()`, `parse_address_list()`, and `parse_mailbox_list()`.
Loading