-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamed version of ber.decoder.decode #176
Conversation
Rewrite Decoder in terms of BytesIO BER Decoders implemented with BytesIO but for the most complex BER UniversalConstructedTypeDecoder in terms of BytesIO BER Decoder (stream-based) suggestion Fixed some of the failing tests Fixed several failed tests Fix all remaining tests but the non-implemented Any Implement untagged Any with back-seek Fix cer and der to work with streams Simplify unnecessary added complexity Make use of IOBase hierarchy (properly?) - in progress Tests failing Fixed most failing tests 1 remaining Severaů small optimizations Fix logging Note: As we do not want to read the whole stream, explicit output of remaining bytes is not used. Rename and document utility functions for BER decoder Fixed ínverted condition in BitStringDecoder.valueDecoder Fixed wrongly acquired fullPosition in AnyDecoder.indefLenValueDecoder Fixed logging None length endOfStream(BytesIO) working in 2.7 Microoptimizations for endOfStream (not using it) Test for checking binary files as substrate Python 2.7 BytesIO wrapper for `file`s Refactor keep API compatibility with original version
Hi @etingof, I added you to the collaborators of my fork, so you can edit directly if interested. |
That looks awesome! Let me check out the code in depth. |
I think it would make a huge sense to run your performance tests against real BER blobs (referring to pyasn1-modules tests). Have you succeeded getting them to work? BTW, I've stashed a 4MB BER blob exactly for such purpose. I can't recall now what ASN.1 data structures it has inside. If you decide trying it out and it fails to decode, I can dig deeper - I think I managed to decode them, but it was very slow! |
They work luckily. I benchmarked them just briefly. The total runtime of tests (my only measure) was perhaps a bit lower, but well inside the std.dev. I can do more serious testing on Monday. The blob sounds perfect, we will see... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through a small portion of the patch, let me feed my kid and continue afterwards. In the meanwhile, I'm sending this incomplete review in case you have time to consider my thoughts.
pyasn1/codec/der/decoder.py
Outdated
@@ -4,10 +4,13 @@ | |||
# Copyright (c) 2005-2019, Ilya Etingof <[email protected]> | |||
# License: http://snmplabs.com/pyasn1/license.html | |||
# | |||
from io import BytesIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: we need to drop Python 2.5 support.
pyasn1/codec/ber/decoder.py
Outdated
elif isinstance(substrate, univ.OctetString): | ||
return BytesIO(substrate.asOctets()) | ||
try: | ||
if _PY2 and isinstance(substrate, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably, explicit check for file
type is not very Pythonic in nature. Perhaps the most usable API would be if we could consume any file-like object (e.g. stream socket)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we can easily set attribute to a "Pythonic" object like BufferedReader but cannot do so with C-implemented file
in Python 2.
As mentioned, there is a solution that passes the backtrack position in the options (and then does not need this) but it proved ~2 % slower in Python for small objects. Depends on your decision.
pyasn1/codec/ber/decoder.py
Outdated
return BytesIO(substrate.asOctets()) | ||
try: | ||
if _PY2 and isinstance(substrate, file): | ||
return BytesIO(substrate.read()) # Not optimal for really large files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I think it makes perfect sense to use file
as is because it's already a perfectly valid and buffered (!) stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct but at the price of speed (as mentioned above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... On the other hand, read()
on a large file might eat up process memory... The good thing about streams is that they keep up low on resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put BufferedReader instead. That should make it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done some more progress in my review. It's not over yet, however I'd like to know your thoughts before I continue because my further feedback would depend upon your thoughts. ;-)
Thank you for the awesome work you've done so far!
pyasn1/codec/ber/decoder.py
Outdated
raise error.PyAsn1Error('Decoder not implemented for %s' % (tagSet,)) | ||
"""Decode value with fixed byte length. | ||
|
||
If the decoder does not consume a precise byte length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to disallow partial decoding here?
What if we have a long SEQUENCE OF
encoded using definite length algorithm - would not it be more usable if, when partial substrate is given, we'd decode as much and yield once we can't proceed anymore?
I imagine this yielding feature could look like this:
- the decoder returns nothing when yielding
- the decoder seeks input stream back to where it has started reading this fragment
- the decoder accumulates partially decoded objects in its call stack
Python coroutine/generator abstraction seems suitable for such design.
WDYT?
raise error.PyAsn1Error('Indefinite length mode decoder not implemented for %s' % (tagSet,)) | ||
"""Decode value with undefined length. | ||
|
||
The decoder is allowed to consume as many bytes as necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick note: this separation of value decoding functions has been made to support different encoding schemas - definite length encoding and indefinite length encoding. Originally, it's not so much about decoding behavior, rather about implementing slightly different encoding algorithms.
However I am not certain that this separation is actually worth it.
Noting this just for the context.
raise error.PyAsn1Error('Indefinite length mode decoder not implemented for %s' % (tagSet,)) | ||
"""Decode value with undefined length. | ||
|
||
The decoder is allowed to consume as many bytes as necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in case when incomplete substrate is given on input? It has never been a requirement for the user to supply well-framed substrate. It is very well possible, especially in the indefinite length context.
An example use-case I can offer would be media streaming over the network - individual scalars (packed into the indefinite length container) are never guaranteed to arrive in full. Here is how BER indefinite length encoding looks like for OCTET STRING
:
<indef octetsting <def octetsting> <def octetsting> ... <def octetsting> <end of octets sentinel>>
So the decoder should consume whatever is complete, and hold off the rest till the missing pieces arrive. That's the reason why definite length decoder returns unprocessed substrate.
Does my explanation make sense?
pyasn1/codec/ber/decoder.py
Outdated
raise TypeError("Cannot convert " + substrate.__class__.__name__ + " to seekable bit stream.") | ||
|
||
|
||
def endOfStream(substrate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call this function in the hot spots, it may become a dragger. What if instead of checking for the EOF condition, we would just go ahead reading from the stream and throw/handle exception if EOF condition occurs?
The motivation is to save on function call(s) if we gonna do that too often...
The other potential benefit would be if we ever push unprocessed substrate back to the stream - in that case EOF condition would become intermittent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is called only in one place. You're right that we can probably remove it.
pyasn1/codec/ber/decoder.py
Outdated
raise error.PyAsn1Error('Decoder not implemented for %s' % (tagSet,)) | ||
"""Decode value with fixed byte length. | ||
|
||
If the decoder does not consume a precise byte length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My other thought is this: although it does not make much practical sense, I think it is not prohibited to mix definite and indefinite length encoded pieces. You can get definitely encoded SEQUENCE with indefinitely encoded components inside. The other way round is typical - definite length items inside indefinite length "container".
I am noting that to make sure we do not impose conflicting requirements on valueDecoder
/ indefValueDecoder
functions having in mind that one can (indirectly) call the other and vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we are not imposing conflicting requirements. Are we?
pyasn1/codec/ber/decoder.py
Outdated
if trailingBits > 7: | ||
raise error.PyAsn1Error( | ||
'Trailing bits overflow %s' % trailingBits | ||
) | ||
|
||
value = self.protoComponent.fromOctetString( | ||
head[1:], internalFormat=True, padding=trailingBits) | ||
substrate.read(length - 1), internalFormat=True, padding=trailingBits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the yielding idea on substrate underrun condition: may be we could do something like this here and everywhere where read is concerned:
theBytes = substrate.read(length - 1)
if theBytes is None: # indicates data starvation in O_NONBLOCK mode
yield # yields control back to the caller at this point, the caller can resume it from here
if len(theBytes) < length - 1:
substrate.seek(-len(theBytes), SEEK_CUR) # unread incomplete input
yield
# otherwise go ahead
The overall behavior would depend on the underlying file descriptor mode (blocking vs non-blocking) which is totally in the hands of the caller. What we could probably do is to support both cases internally e.g.:
Blocking mode:
- blocking on read(count)
- yielding when read count is less than desired
- yielding when end-of-stream is not yet reached
Non-blocking mode:
- yielding on input starvation
- yielding when read count is less than desired
- yielding when end-of-stream is not yet reached
At the high level, the decoding would look like this:
for asn1Object in decoder(stream):
if asn1Object is None:
# incomplete input, wait for it (e.g. via select(stream.fileno)) and retry
continue
print('received ', asn1Object)
Where decoder(stream)
would produce an iterable (generator) which would emit ASN.1 structures till input stream is exhausted.
If we go this way, I hope we could still preserve backward compatibility in the decode
function, while decodeStream
would exhibit all these shiny new features...
In the next major release we could move decodeStream
to decode
, as you've suggested.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit that I am not so versed in coroutines, generators, etc. I cannot imagine how this would work or be implemented. If you can do this, I will gladly observe and learn.
Are you using |
Unfortunately, it won't parse without the specs :-( Do you think you would be able to find them? Thanks. |
While looking up the spec, I run into another stress testing ground. This link leads you to a collection of up to 43MB blobs of CRLs. They should be decodable with this script:
|
Codecov Report
@@ Coverage Diff @@
## master #176 +/- ##
==========================================
- Coverage 85.68% 85.49% -0.19%
==========================================
Files 25 26 +1
Lines 4324 4662 +338
==========================================
+ Hits 3705 3986 +281
- Misses 619 676 +57
Continue to review full report at Codecov.
|
The streamed version takes ~2 minutes, the non-streamed one 1 hour. Which is consistent with my expectation of a roughly quadratic behaviour in size. |
02660d0
to
e27f971
Compare
In e27f971, I am adding a wrapper around non-seekable streams. It will make them slower (added ~ 10 % to execution time when wrapping regular files that do not need this functionality) but at least working (the only non-seekable stream I can uses is a file read directly from a ZIP archive in Python 2.7 and that passes the test). Coupled with wrapping old |
Hats off! |
The goal of this change is to make the decoder stopping on input data starvation and resuming from where it stopped whenever the caller decides to try again (hopefully making sure that some more input becomes available). This change makes it possible for the decoder to operate on streams of data (meaning that the entire DER blob might not be immediately available on input). On top of that, the decoder yields partially reconstructed ASN.1 object on input starvation making it possible for the caller to inspect what has been decoded so far and possibly consume partial ASN.1 data. All these new feature are natively available through `StreamingDecoder` class. Previously published API is implemented as a thin wrapper on top of that ensuring backward compatibility.
Co-Authored-By: Jan Pipek <[email protected]>
Co-Authored-By: Jan Pipek <[email protected]>
Try to reuse `SingleItemDecoder` object to leverage its caches.
…asn1 into support-incomplete-substrate
Make it looking more uniform and easier to override if needed.
This change should simplify decoder specialization by means of parameterization in addition to subclassing.
Turn BER decoder into a suspendible generator
Thank you! |
Nice, thank you. I hope I will get some time to have another look on this on Monday. Is there something particular necessary regarding this MR? |
I am going to merge it, otherwise merging other pending changes would be a hassle going forward. You are very welcome to take another look and push another PR if you find something to improve there. |
Nice. |
(An updated version of #175):
Changes in API:
Thus, no API should break unless the author decides so (which I suggest as another issue).
I tested heavily with Python 2.7 and 3.7, CI should care of the rest.
open()
results added. If any other kind of stream is supposed to be used, it can (and should be added).Performance:
decode
is ~5 % slower.Note: As
file
object in Python 2.7 does not allow direct setting of attributes, it's first read into memory in this version. A trivial patch, that slows the overall performance by 2-3 %, but avoids pre-reading the file into memory, can be provided.What was considered as well (but did not work / was not performing enough):
Please, let me know what changes are there still to be made.
Thanks
Jan