Description
The diff algorithm does not always keep HTML entity references intact, which can be problematic when later loading the resulting diff as HTML (e.g. when diffing a list entry). This is best illustrated by the following test:
<options></options>
<oldText>
<ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non justo & sapien;</li></ol>
</oldText>
<newText>
<ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non sapien et justo;</li></ol>
</newText>
<expected>
<ol class="diff-list"><li class="normal">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> & sapien</del>;</li></ol>
</expected>
This test currently crashes with an error message DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1
from ListDiffLines.php:409
. This is because it tries to load the string
<body>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &</del>;<del class="diffdel"> sapien;</del></body>
You can see that the diff algorithm broke up the &
from its terminating ;
, which leads to invalid HTML.
The solution might be to consider HTML entities a special case in the regex in AbstractDiff.php
line 457 (e.g. adding an extra &[a-zA-Z0-9]+;
case there). That fixes the given test without breaking the others, but I cannot oversee what further possible impact that may have.