[FeedExpander] Add prepareXml() overridable function #4485

ORelio · 2025-03-20T14:24:21Z

What this pull request does

FeedExpander.php

Introduce overridable prepareXml($xmlString) function and move existing cleanup code inside
~~Auto-remove trailing content after root xml node~~ (removed from PR, see discussion below)

Use case: remove analytic tags inserted in XML feeds

One of my bridge stopped working with the following error:

Type: Exception
Code: 0
Message: Unable to parse xml: Extra content at the end of the document
File: lib/FeedParser.php
Line: 26
Trace
#0 index.php(49): RssBridge->main()
#1 lib/RssBridge.php(57): DisplayAction->execute()
#2 actions/DisplayAction.php(71): DisplayAction->createResponse()
#3 actions/DisplayAction.php(106): CssSelectorFeedExpanderBridge->collectData()
#4 bridges/CssSelectorFeedExpanderBridge.php(61): FeedParser->parseFeed()
#5 lib/FeedParser.php(26)

Turns out the site's feed had an extra script tag from CloudFlare:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Numerama</title>
	<atom:link href="https://www.numerama.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.numerama.com/</link>
	<description>Le média de référence sur la société numérique et l&#039;innovation technologique</description>
	<lastBuildDate>Thu, 20 Mar 2025 10:00:36 +0000</lastBuildDate>
<!-- [...] Feed content [...] -->
	</channel>
</rss>
<script defer src="https://static.cloudflareinsights.com/beacon.min.js/vcd15cbe7772f49c399c6a5babf22c1241717689176015" integrity="sha512-ZpsOmlRQV6y907TI0dKBHq9Md29nnaEIPlkf84rnaERnq6zvWvPUqr2ft8M1aS28oN72PdrCzSjY4U6VaAw1EQ==" data-cf-beacon='{"rayId":"92345edefe2203f1","serverTiming":{"name":{"cfExtPri":true,"cfL4":true,"cfSpeedBrain":true,"cfCacheStatus":true}},"version":"2025.1.0","token":"8eedbc8e52114850a5577af1da359bcd"}' crossorigin="anonymous"></script>

~~This PR adds auto-cleaning to remove trailing data causing XML parsing to fail.~~
This PR allows overriding prepareXml($xmlString) from a bridge to clean XML before it gets parsed.

Seems like all my bridges still load fine on my instance after the change, and this fixed my broken feed. If you think this could break things, let me know and I'll move that code in a separate bridge on my instance.

- Move preprocessing code into overridable preprocessXml() - Auto-remove trailing data after root xml node

dvikan · 2025-03-23T20:27:51Z

looks fine but hard to say whether this introduces bugs (due to the hard-to-read regex)

ORelio · 2025-03-23T21:41:34Z

Okay, I'll try to explain the regex /(?:<\?xml[^>]*\?>[^<]*<)([^ "\'>]+)/i, whose goal is finding the root node tag:

https://regex101.com/r/NmetjG/1

/<REGEX>/i for case insensitive
then two groups:
- (?:<\?xml[^>]*\?>[^<]*<) non-capturing group (?:<REGEX>) for skipping the <?xml .... ?> prolog and possible spaces between ?> and following <:
  - <?xml literally (regex is case insensitive)
  - [^>]* anything until closing >
  - \?> closing tag literally
  - [^<]* anything until next opening <
- capturing group ([^ "\'>]+) to get everything up to the next space, quote or closing tag
  - should match the tag name of the root node

Now, the same code without regex and error handling would look like this:

//find `<?xml ... ?>`
$prolog_start = stripos($xmlString, '<?xml');
$prolog_end = strpos($xmlString, '?>', $prolog_start);

//find first `<node attr="data">` after `<?xml ... ?>`
$root_node_start = strpos($xmlString, '<', $prolog_end);
$root_node_end = strpos($xmlString, '>', $root_node_start);
$root_node_tag = substr($xmlString, $root_node_start + 1, $root_node_end - $root_node_start - 1);

//convert `<node attr="data">` into `node`
$root_node_tag = explode(' ', $root_node_tag)[0];
$root_node_tag = explode('"', $root_node_tag)[0];
$root_node_tag = explode("'", $root_node_tag)[0];

//find last occurrence of </node> and delete everything after that
$closing_node_start = strripos($xmlString, '</' . $root_node_tag);
$closing_node_end = strpos($xmlString, '>', $closing_node_start);
$xmlString = substr($xmlString, 0, $closing_node_end + 1);

With error handling (do not touch $xmlString if we are not 100% sure):

//find <?xml ... ?>
$prolog_start = stripos($xmlString, '<?xml');
if ($prolog_start !== false) {
    $prolog_end = strpos($xmlString, '?>', $prolog_start);
    if ($prolog_end !== false) {

        //find first `<node attr="data">` after `<?xml ... ?>`
        $root_node_start = strpos($xmlString, '<', $prolog_end);
        if ($root_node_start !== false) {
            $root_node_end = strpos($xmlString, '>', $root_node_start);
            if ($root_node_end !== false) {
                $root_node_tag = substr($xmlString, $root_node_start + 1, $root_node_end - $root_node_start - 1);

                //convert `<node attr="data">` into `node`
                $root_node_tag = explode(' ', $root_node_tag)[0];
                $root_node_tag = explode('"', $root_node_tag)[0];
                $root_node_tag = explode("'", $root_node_tag)[0];

                //find last occurrence of </node> and delete everything after that
                $closing_node_start = strripos($xmlString, '</' . $root_node_tag);
                if ($closing_node_start !== false) {
                    $closing_node_end = strpos($xmlString, '>', $closing_node_start);
                    if ($closing_node_end !== false) {
                        $xmlString = substr($xmlString, 0, $closing_node_end + 1);
                    }
                }
            }
        }
    }
}

Again, if none of these approaches seems satisfactory for code reliability and maintainability, that's okay, I'll remove it from FeedExpander and implement it on my own bridge overriding prepareXml($xmlString).

dvikan · 2025-03-25T22:56:28Z

i dunno man. you make the call. ill merge if you want

ORelio · 2025-03-26T07:58:56Z

OK. Just to be safe, I'll move this code to a separate bridge and will come back with it if I encounter one more site with this kind of feed malformation. I'll change the PR to just include the overridable prepareXml($xmlString) function.

Will add back later if more sites have the same issue

dvikan · 2025-03-30T20:19:27Z

you can type hint both function param and function return value

ORelio added 3 commits March 20, 2025 15:11

FeedExpander: Remove tailing content in XML

c7b5a3f

- Move preprocessing code into overridable preprocessXml() - Auto-remove trailing data after root xml node

FeedExpander: Add PR reference with use case

a29d70f

FeedExpander: Code linting

359f6c4

ORelio changed the title ~~FeedExpander: Remove tailing content in XML~~ FeedExpander: Remove trailing content in XML Mar 20, 2025

ORelio changed the title ~~FeedExpander: Remove trailing content in XML~~ [FeedExpander] Remove trailing content in XML Mar 20, 2025

ORelio marked this pull request as draft March 26, 2025 07:57

ORelio force-pushed the master branch from c264a7c to 3880326 Compare March 30, 2025 10:56

[FeedExpander] Keep content at end of document for now

3880326

Will add back later if more sites have the same issue

ORelio changed the title ~~[FeedExpander] Remove trailing content in XML~~ [FeedExpander] Add prepareXml() overridable function Mar 30, 2025

ORelio marked this pull request as ready for review March 30, 2025 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FeedExpander] Add prepareXml() overridable function #4485

[FeedExpander] Add prepareXml() overridable function #4485

ORelio commented Mar 20, 2025 •

edited

Loading

dvikan commented Mar 23, 2025

ORelio commented Mar 23, 2025

dvikan commented Mar 25, 2025

ORelio commented Mar 26, 2025

dvikan commented Mar 30, 2025

[FeedExpander] Add prepareXml() overridable function #4485

Are you sure you want to change the base?

[FeedExpander] Add prepareXml() overridable function #4485

Conversation

ORelio commented Mar 20, 2025 • edited Loading

dvikan commented Mar 23, 2025

ORelio commented Mar 23, 2025

dvikan commented Mar 25, 2025

ORelio commented Mar 26, 2025

dvikan commented Mar 30, 2025

ORelio commented Mar 20, 2025 •

edited

Loading