-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing [[[[[[… takes quadratic time #798
Comments
I expect this is because the link processor allows for nested brackets. On the bright side, it doesn't crash (many such things raised recursion errors back in version <2.0). The fact is, performance is not a high priority for us. If you are providing such a unrealistic input, then I don't really care that it is slow. Although, I'm curious how different this is for version 2.x as we changed the bracket parsing in 3.0. |
Makes sense to me. I'm sure it's running through the entire algorithm for each Currently, I see no problem with this behavior as the current compliant is based on an unrealistic situation. A document consisting of a chain of 13,000 Before, we used regular expression which failed miserably when tasked with nested If we are just concerned someone could create a document that freezes up the parser, we could add nested bracket limit just to prevent the parser from getting tied up in ridiculously large nested situations. |
Thanks for the quick response. My perspective is that I’m concerned about denial of service attacks on applications parsing user input as Markdown. This isn’t an academic exercise—this very comment is user input that GitHub is parsing as Markdown, and you can imagine that they’d be really sad if I could easily cause their parser to spin for minutes or hours, whether maliciously or by accident. In an even less hypothetical sense, I’m filing this issue in the aftermath of a real denial of service attack that actually took down a large application via a complexity attack on Python-Markdown. Now, obviously, you’re under no obligation to support my use case. You are free to decide that Python-Markdown is not designed for use on untrusted input. Part of the reason I’m filing this is that I’d like to know how these kinds of issues will be handled. Should I file more of them, or should I work on finding a Markdown library that better suits my requirements? |
Context is important, and I think pointing this out is important when filing such an issue. I suspected near the end of my reply that the act of a hanging parser might have been the direction you were heading in, which is why I also suggested the possibility of limiting the recursion. In general, you aren't going to get a performance boost unless you cache the nested brackets that you've already evaluated to avoid reprocessing them again later as you move down the list of By limiting the recursion depth, let's say by 10, instead of processing 13,000 nested |
This projects goals are documented as follows (emphasis added):
In general, the way that any Markdown parser informs the document author that they provided bad syntax is by outputting an illformed document. In other words, Markdown always returns output. That said, we do no input sanitation. If input causes the parser to crash, we consider that a bug and will modify the parser to not crash. However, that is it. Performance is not a high priority (it is not even mentioned in the "goals"). If you are allowing your users to post large documents, then it is your responsibility to sanitize them properly (perhaps limit their size?). By the way, for a good review of the issues involved in avoiding XSS attacks via Markdown I suggest reading Markdown and XSS by Michel Fortin (the developer of PHP Markdown). There is no Markdown parser which fully addresses that issue on its own. If you need to address this yourself, then I would think you should expect to address other attack vectors on your own as well. In other words, as a security concern, this is out-of-scope for any Markdown parser to be concerned with. Regardless, there are some parsers out there which put a higher priority on performance. But they don't have the low barrier-to-entry for creating extensions. We simply have chosen a different set of compromises. Only you can decide which set of compromises meets your needs. |
I suppose its noteworthy that in previous versions, we supported 6 levels of brackets. And that was because we used a regex which had six levels hardcoded into it. In over a decade, I don't recall ever getting a complaint about that limit. I suppose we could artificially limit the existing implementation to some arbitrary nesting level. |
It's an idea at least. I think the current algorithm is more robust than the regex was, but while it is nice to be able to support any amount of nesting, maybe endless nesting is a bit unrealistic 🤷♂️ . |
If you’re concerned about crashing with recursion errors, what are your thoughts on (Personally I’m somewhat less concerned by that, since it’s much easier to detect and recover from an explicit crash.) |
There are areas where there may still be recursion, and possibly some areas that are hard for us to eliminate currently. And yes with recursion, you can get crashes when the limit is exceeded. Generally, we try to avoid crashes were possible. The point that was being made is that the new algorithm for handling It would still allow nesting horizontally |
There also seem to be problems in the horizontal direction: |
When I mentioned horizontal, I was referring to horizontal nesting, which does not perform as badly as it doesn't re-evaluate the same things as often.
Unfortunately, I don't agree that there is problems with this. This is simply evaluating 20,000 links. If you give the parser large enough data to parse, it's going to take a while. You can probably use almost any syntax with a big enough number to cause things to freeze up. At that point, maybe you should implement a time out on your server, or limit the size of your input coming in. |
One would think that careful design would avoid needing 20,000 ⋅ 20,000 operations to parse 20,000 links; there’s no fundamental reason that each link needs to do something with each other link. I’ll grant that perhaps the current design of the parser makes this difficult to avoid, but I think it’s wrong to dismiss the problem as “simply too much data”. The question is whether there’s interest in trying to solve it. |
If there's a legitimate problem then we'll of course fix it. We'll have to double check what the algorithm is doing. If parsing time exploding then it would most likely be because it is trying to parse the whole chain of links because the algorithm doesn't cut off the parsing after the first link, which might be possible. There might be a case that is causing that. |
Can you please file a separate issue for this bug? (FWIW |
That's what I keep saying. This is not a priority for us. Personally, if you feed the parser such a ridiculous input you deserve to have it break. And if the problem is that untrusted users are feeding the parser, then is is your responsibility to sanitize their input. Regarding the size limit; the parser holds the entire document in memory. So any plain text could break things just by being too big to fit in memory. So yeah, if you are accepting input from untrusted users, you need to have a size limit. |
I briefly looked at this as I found the statement odd that we were matching this 20000 * 20000. While I won't rule out that maybe something is is getting called that expands the time, I put in a few print statements just to see what was happening. My worry was that we were maybe parsing the entire string every time because maybe the link didn't match because the brackets were empty, this does not seem to be the case. Here we print every time we perform a match, and we log each character evaluated when looking at the content for text
Now I know that Python Markdown will then run the parser on the content of each element returned, I imagine that may be where time expands. The more elements you create, the more times you have to run the full inline processor. That is just how Python Markdown currently does things. Now there is no content in these elements, so maybe it is wasting time on running patterns on empty strings, or maybe it skips empty content, I don't know as I didn't bother to dig deeper than this. It doesn't appear we are parsing the string via the link patterns in a way that would cause the ballooning of time, so that at least pacifies my initial concern. If the time increase is only because we are recursively parsing element content, maybe we could do a simple check to test if there is any content to actually evaluate (if we aren't already doing that), but you'd be out of luck if you did this Again this is an unrealistic case generating 20000 links like this. I guess if something specifically was identified as problematic, I think we'd be open. Like "we can improve performance by X by changing code Y to Z". And pull requests are always welcome as well. Right now, I think the case of |
I'm willing to entertain any PR as long as it maintains backward compatibility. I'll make an exception to the backward comparability rule if someone wants to submit a PR limiting the nesting to some arbitrary number. Regarding running patterns on empty content, I suspect the real problem there is the way we handle nesting. Specifically, we can't run a regex pattern across multiple parts of a string when those parts are stored in different elements in the ElementTree object. So we do some "magic" to combine them into a single string without losing the structure. That "magic" complicates the inline parsing by orders of magnitude and is the likely cause of many performance issues. Personally, I never liked the solution, but it is what we have. We discussed removing it for version 3.0 by ultimately decided not to to maintain backward compatibility with existing extensions. And that is the compromise that we have chosen to make. We are one of the most extensible implementations, but not a performant implementation. If you want both then you need to use a two-part implementation (a parser outputs an abstract syntax tree (AST) and a renderer renders that AST into HTML). Python-Markdown is not that and turning it into that would require starting over from scratch. If you want that, go use Mistune. |
@facelessuser I want to be clear: Python should be more than capable of doing 20,000 operations on empty or very short strings in a tiny fraction of a second. When I say it takes quadratic time, I mean that I measured The problem isn’t specific to empty text or empty URLs either, you can fill them with real content and observe the same slowdown. You can observe some of this quadratic work here: --- a/markdown/treeprocessors.py
+++ b/markdown/treeprocessors.py
@@ -310,6 +310,7 @@ class InlineProcessor(Treeprocessor):
placeholder = self.__stashNode(node, pattern.type())
if new_style:
+ print(len(data[:start]), len(placeholder), len(data[end:]))
return "%s%s%s" % (data[:start],
placeholder, data[end:]), True, 0
else: # pragma: no cover This gives 20,000 lines of output, summing to a total of about 9.13 · 20,000 · 20,000 bytes concatenated over the entire run. |
@andersk the slowdown is most likely related to the issue in my last comment, not the patterns themselves. I know I've explained this before, but I can't find it so here goes.... Each inline processor returns a ElementTree object. And if a Markdown element spans multiple ElementTree elements we can't find that with a single pattern match without some tricks. For example, consider this input with some nested elements:
And now we need to run the emphasis pattern. But how do we do that when the text is split across 3 separate strings? The current solution takes the contents of the You can argue till your blue in the face that the parser should be faster, but unless you can provide a better (backwards compatible) solution to this problem, we are not going to move forward. And if you find this unacceptable, then I suggest looking at other libraries. That said, unfortunately, there are not many good options in Python based Markdown parsers if performance is of high importance to you (see lepture/mistune#1). |
@andersk, yes, I understood your original statement. The opening post is definitely related to the pattern to some extent, but I wanted to make sure Other than that, I think @waylan has highlighted what the real issue most likely is. |
I understand. I can accept “backwards compatibility with an API designed without realizing the complexity implications” as a good explanation, while vehemently rejecting “too much data”. Hopefully this issue will be considered in any future API redesigns, and if so, I will consider that a small victory. Meanwhile, there are some complexity issues that probably could be fixed without a major redesign.
These are even more potent because individual regex operations hold the GIL and can’t be interrupted, but they could be fixed with more careful regex design. When looking for delimited syntax, you need to make sure never to match multiple opening delimiters. For example, -AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>'
+AUTOMAIL_RE = r'<([^<> \!]*@[^@<> ]*)>' |
Part of the discussion in Python-Markdown#798. Signed-off-by: Anders Kaseorg <[email protected]>
See also #161, specifically this comment. I suspect that when we closed that we only addressed the first issue listed there and never got to the rest. |
Part of the discussion in Python-Markdown#798. Signed-off-by: Anders Kaseorg <[email protected]>
Yeah, better patterns is always good. |
Part of the discussion in #798. Signed-off-by: Anders Kaseorg <[email protected]>
In case it's useful to anyone, this regex also takes quadratic time markdown/markdown/blockprocessors.py Lines 579 to 581 in 421f1e8
|
markdown.markdown('[' * 13000)
takes about a minute (in 3.0.1 and master), with the time increasing by about a factor of four every time the length is doubled.The text was updated successfully, but these errors were encountered: