Skip to content

Conversation

@Crozzers
Copy link
Contributor

@Crozzers Crozzers commented Oct 5, 2025

This PR fixes #641, fixes #642 and fixes #643.

I put all these fixes in the same PR because I wanted to make sure they were compatible with each other, but that does mean it's a bit messy.

Middle word em breaking (*em*) (#641)

The middle word em regex would check for *_ chars that aren't preceded by another em char or a whitespace. The text (*em*) matches that, as the leading em char is not preceded by a space. It would then prevent this text from being processed as a valid em.
I've updated the regex to look for ems that aren't preceded by non-word chars (instead of whitespace) and that fixed this issue.
The result is that we process this as expected:

(<em>em</em>)

Improve handling for leading underscores (#642)

In this issue we had what looked like an <em> span, but it was straddling two other <strong> spans:

**_confusing** ident is **_confusing**

This is not a valid em. Spans can be nested but they shouldn't stay open after the parent span closes.

I added some additional logic in the italics and bold stage that will check to see if the matched strong/em has any nested spans and that those spans are balanced and closed. If not, the strong/em is deemed invalid.

The result is that we process the strongs here, but not the em:

<p><strong>_confusing</strong> ident is <strong>_confusing</strong></p>

Consecutive strong/em can overlap (#643)

The strong/em regexes were starting their matches early as possible, and including as much text in the span as possible. This lead to the following text being processed like so:

  1. **strong***em***strong**
  2. <strong>strong*</strong>em<strong>*strong</strong>
  3. <strong>strong<em></strong>em<strong></em>strong</strong>

This renders fine in most browsers, but is invalid html.

To fix this, I modified the strong regex to try to ignore as many leading *_ chars as possible to try to get the opening <strong> tag as close to the actual contents as possible, and try to close the </strong> as soon as possible.

Previously the strong regex would process ***abc*** as <strong>*abc*</strong> but now it will do *<strong>abc</strong>

The effect of this is when we have consecutive strong and ems, they won't overlap anymore.

The unfortunate side effect is Github will process ***abc** as <strong>*abc</strong>, but we will output *<strong>abc</strong> instead, omitting that first em char from the span.

@nicholasserra
Copy link
Collaborator

This all looks reasonable, thank you for taking on all these edge cases!

@nicholasserra nicholasserra merged commit 9a88ce1 into trentm:master Oct 6, 2025
15 checks passed
@justanotheranonymoususer

Thanks!

The unfortunate side effect is Github will process ***abc** as <strong>*abc</strong>, but we will output *<strong>abs</strong> instead, omitting that first em char from the span.

Actually, GitHub seems to output *<strong>abc</strong> as well, so looks like a fix too.

@justanotheranonymoususer

Quick report for gaps/regressions, I'll create issues later unless you ninja-fix it:

**one*two***
A_**B **text **c** d
x A_**B** y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

**strong***em***strong** incorrect result Improve handling for leading underscores Disabling middle-word-em breaks (*em*)

3 participants