Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeedExpander] Add prepareXml() overridable function #4485

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ORelio
Copy link
Contributor

@ORelio ORelio commented Mar 20, 2025

What this pull request does

FeedExpander.php

  • Introduce overridable prepareXml($xmlString) function and move existing cleanup code inside
  • Auto-remove trailing content after root xml node (removed from PR, see discussion below)

Use case: remove analytic tags inserted in XML feeds

One of my bridge stopped working with the following error:

Type: Exception
Code: 0
Message: Unable to parse xml: Extra content at the end of the document
File: lib/FeedParser.php
Line: 26
Trace
#0 index.php(49): RssBridge->main()
#1 lib/RssBridge.php(57): DisplayAction->execute()
#2 actions/DisplayAction.php(71): DisplayAction->createResponse()
#3 actions/DisplayAction.php(106): CssSelectorFeedExpanderBridge->collectData()
#4 bridges/CssSelectorFeedExpanderBridge.php(61): FeedParser->parseFeed()
#5 lib/FeedParser.php(26)

Turns out the site's feed had an extra script tag from CloudFlare:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Numerama</title>
	<atom:link href="https://www.numerama.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.numerama.com/</link>
	<description>Le média de référence sur la société numérique et l&#039;innovation technologique</description>
	<lastBuildDate>Thu, 20 Mar 2025 10:00:36 +0000</lastBuildDate>
<!-- [...] Feed content [...] -->
	</channel>
</rss>
<script defer src="https://static.cloudflareinsights.com/beacon.min.js/vcd15cbe7772f49c399c6a5babf22c1241717689176015" integrity="sha512-ZpsOmlRQV6y907TI0dKBHq9Md29nnaEIPlkf84rnaERnq6zvWvPUqr2ft8M1aS28oN72PdrCzSjY4U6VaAw1EQ==" data-cf-beacon='{"rayId":"92345edefe2203f1","serverTiming":{"name":{"cfExtPri":true,"cfL4":true,"cfSpeedBrain":true,"cfCacheStatus":true}},"version":"2025.1.0","token":"8eedbc8e52114850a5577af1da359bcd"}' crossorigin="anonymous"></script>

This PR adds auto-cleaning to remove trailing data causing XML parsing to fail.
This PR allows overriding prepareXml($xmlString) from a bridge to clean XML before it gets parsed.

Seems like all my bridges still load fine on my instance after the change, and this fixed my broken feed. If you think this could break things, let me know and I'll move that code in a separate bridge on my instance.

ORelio added 3 commits March 20, 2025 15:11
- Move preprocessing code into overridable preprocessXml()
- Auto-remove trailing data after root xml node
@ORelio ORelio changed the title FeedExpander: Remove tailing content in XML FeedExpander: Remove trailing content in XML Mar 20, 2025
@ORelio ORelio changed the title FeedExpander: Remove trailing content in XML [FeedExpander] Remove trailing content in XML Mar 20, 2025
@dvikan
Copy link
Contributor

dvikan commented Mar 23, 2025

looks fine but hard to say whether this introduces bugs (due to the hard-to-read regex)

@ORelio
Copy link
Contributor Author

ORelio commented Mar 23, 2025

Okay, I'll try to explain the regex /(?:<\?xml[^>]*\?>[^<]*<)([^ "\'>]+)/i, whose goal is finding the root node tag:

https://regex101.com/r/NmetjG/1
image

  • /<REGEX>/i for case insensitive
  • then two groups:
    • (?:<\?xml[^>]*\?>[^<]*<) non-capturing group (?:<REGEX>) for skipping the <?xml .... ?> prolog and possible spaces between ?> and following <:
      • <?xml literally (regex is case insensitive)
      • [^>]* anything until closing >
      • \?> closing tag literally
      • [^<]* anything until next opening <
    • capturing group ([^ "\'>]+) to get everything up to the next space, quote or closing tag
      • should match the tag name of the root node

Now, the same code without regex and error handling would look like this:

//find `<?xml ... ?>`
$prolog_start = stripos($xmlString, '<?xml');
$prolog_end = strpos($xmlString, '?>', $prolog_start);

//find first `<node attr="data">` after `<?xml ... ?>`
$root_node_start = strpos($xmlString, '<', $prolog_end);
$root_node_end = strpos($xmlString, '>', $root_node_start);
$root_node_tag = substr($xmlString, $root_node_start + 1, $root_node_end - $root_node_start - 1);

//convert `<node attr="data">` into `node`
$root_node_tag = explode(' ', $root_node_tag)[0];
$root_node_tag = explode('"', $root_node_tag)[0];
$root_node_tag = explode("'", $root_node_tag)[0];

//find last occurrence of </node> and delete everything after that
$closing_node_start = strripos($xmlString, '</' . $root_node_tag);
$closing_node_end = strpos($xmlString, '>', $closing_node_start);
$xmlString = substr($xmlString, 0, $closing_node_end + 1);

With error handling (do not touch $xmlString if we are not 100% sure):

//find <?xml ... ?>
$prolog_start = stripos($xmlString, '<?xml');
if ($prolog_start !== false) {
    $prolog_end = strpos($xmlString, '?>', $prolog_start);
    if ($prolog_end !== false) {

        //find first `<node attr="data">` after `<?xml ... ?>`
        $root_node_start = strpos($xmlString, '<', $prolog_end);
        if ($root_node_start !== false) {
            $root_node_end = strpos($xmlString, '>', $root_node_start);
            if ($root_node_end !== false) {
                $root_node_tag = substr($xmlString, $root_node_start + 1, $root_node_end - $root_node_start - 1);

                //convert `<node attr="data">` into `node`
                $root_node_tag = explode(' ', $root_node_tag)[0];
                $root_node_tag = explode('"', $root_node_tag)[0];
                $root_node_tag = explode("'", $root_node_tag)[0];

                //find last occurrence of </node> and delete everything after that
                $closing_node_start = strripos($xmlString, '</' . $root_node_tag);
                if ($closing_node_start !== false) {
                    $closing_node_end = strpos($xmlString, '>', $closing_node_start);
                    if ($closing_node_end !== false) {
                        $xmlString = substr($xmlString, 0, $closing_node_end + 1);
                    }
                }
            }
        }
    }
}

Again, if none of these approaches seems satisfactory for code reliability and maintainability, that's okay, I'll remove it from FeedExpander and implement it on my own bridge overriding prepareXml($xmlString).

@dvikan
Copy link
Contributor

dvikan commented Mar 25, 2025

i dunno man. you make the call. ill merge if you want

@ORelio ORelio marked this pull request as draft March 26, 2025 07:57
@ORelio
Copy link
Contributor Author

ORelio commented Mar 26, 2025

OK. Just to be safe, I'll move this code to a separate bridge and will come back with it if I encounter one more site with this kind of feed malformation. I'll change the PR to just include the overridable prepareXml($xmlString) function.

Will add back later if more sites have the same issue
@ORelio ORelio changed the title [FeedExpander] Remove trailing content in XML [FeedExpander] Add prepareXml() overridable function Mar 30, 2025
@ORelio ORelio marked this pull request as ready for review March 30, 2025 14:49
@dvikan
Copy link
Contributor

dvikan commented Mar 30, 2025

you can type hint both function param and function return value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants