Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamed version of ber.decoder.decode #176

Merged
merged 48 commits into from
Nov 23, 2019

Conversation

janpipek
Copy link
Contributor

@janpipek janpipek commented Sep 6, 2019

(An updated version of #175):

I implemented the functionality so that instead of copying strings, it passes a BytesIO object stream whenever possible. It should solve the #174 issue.

Changes in API:

  • decodeStream in the modules accepts a stream directly
  • decode behaves as it did before, constructing a new BytesIO object and passing it to decodeStream

Thus, no API should break unless the author decides so (which I suggest as another issue).

I tested heavily with Python 2.7 and 3.7, CI should care of the rest.

  • test for open() results added. If any other kind of stream is supposed to be used, it can (and should be added).

Performance:

  • for the tiniest objects (like "false"), the new decode is ~5 % slower.
  • for example objects of 10-50 bytes, it's ~0-2 % faster.
  • for a 2MB file containing many objects, it's ~5 times faster.

Note: As file object in Python 2.7 does not allow direct setting of attributes, it's first read into memory in this version. A trivial patch, that slows the overall performance by 2-3 %, but avoids pre-reading the file into memory, can be provided.

What was considered as well (but did not work / was not performing enough):

  • using memoryview (this could potentially work with bytes if the code were refactored a lot)
  • wrapping input in a custom stream-like object that keeps the content and position - the cost of creating objects and accessing attributes was too high and still all streams were not covered.

Please, let me know what changes are there still to be made.
Thanks
Jan

Rewrite Decoder in terms of BytesIO

BER Decoders implemented with BytesIO but for the most complex

BER UniversalConstructedTypeDecoder in terms of BytesIO

BER Decoder (stream-based) suggestion

Fixed some of the failing tests

Fixed several failed tests

Fix all remaining tests but the non-implemented Any

Implement untagged Any with back-seek

Fix cer and der to work with streams

Simplify unnecessary added complexity

Make use of IOBase hierarchy (properly?) - in progress

Tests failing

Fixed most failing tests

1 remaining

Severaů small optimizations

Fix logging

Note: As we do not want to read the whole stream, explicit output of remaining bytes is not used.

Rename and document utility functions for BER decoder

Fixed ínverted condition in BitStringDecoder.valueDecoder

Fixed wrongly acquired fullPosition in AnyDecoder.indefLenValueDecoder

Fixed logging None length

endOfStream(BytesIO) working in 2.7

Microoptimizations for endOfStream (not using it)

Test for checking binary files as substrate

Python 2.7 BytesIO wrapper for `file`s

Refactor keep API compatibility with original version
@janpipek janpipek changed the title Streamed version of ber.decoder.decode WIP: Streamed version of ber.decoder.decode Sep 6, 2019
@janpipek janpipek changed the title WIP: Streamed version of ber.decoder.decode Streamed version of ber.decoder.decode Sep 6, 2019
@janpipek
Copy link
Contributor Author

janpipek commented Sep 6, 2019

Hi @etingof, I added you to the collaborators of my fork, so you can edit directly if interested.

@etingof
Copy link
Owner

etingof commented Sep 6, 2019

Hi @etingof, I added you to the collaborators of my fork, so you can edit directly if interested.

That looks awesome! Let me check out the code in depth.

@etingof
Copy link
Owner

etingof commented Sep 6, 2019

I think it would make a huge sense to run your performance tests against real BER blobs (referring to pyasn1-modules tests). Have you succeeded getting them to work?

BTW, I've stashed a 4MB BER blob exactly for such purpose. I can't recall now what ASN.1 data structures it has inside. If you decide trying it out and it fails to decode, I can dig deeper - I think I managed to decode them, but it was very slow!

@janpipek
Copy link
Contributor Author

janpipek commented Sep 6, 2019

They work luckily. I benchmarked them just briefly. The total runtime of tests (my only measure) was perhaps a bit lower, but well inside the std.dev. I can do more serious testing on Monday. The blob sounds perfect, we will see...

Copy link
Owner

@etingof etingof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through a small portion of the patch, let me feed my kid and continue afterwards. In the meanwhile, I'm sending this incomplete review in case you have time to consider my thoughts.

@@ -4,10 +4,13 @@
# Copyright (c) 2005-2019, Ilya Etingof <[email protected]>
# License: http://snmplabs.com/pyasn1/license.html
#
from io import BytesIO
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: we need to drop Python 2.5 support.

elif isinstance(substrate, univ.OctetString):
return BytesIO(substrate.asOctets())
try:
if _PY2 and isinstance(substrate, file):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably, explicit check for file type is not very Pythonic in nature. Perhaps the most usable API would be if we could consume any file-like object (e.g. stream socket)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we can easily set attribute to a "Pythonic" object like BufferedReader but cannot do so with C-implemented file in Python 2.

As mentioned, there is a solution that passes the backtrack position in the options (and then does not need this) but it proved ~2 % slower in Python for small objects. Depends on your decision.

return BytesIO(substrate.asOctets())
try:
if _PY2 and isinstance(substrate, file):
return BytesIO(substrate.read()) # Not optimal for really large files
Copy link
Owner

@etingof etingof Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think it makes perfect sense to use file as is because it's already a perfectly valid and buffered (!) stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct but at the price of speed (as mentioned above).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... On the other hand, read() on a large file might eat up process memory... The good thing about streams is that they keep up low on resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put BufferedReader instead. That should make it better.

Copy link
Owner

@etingof etingof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some more progress in my review. It's not over yet, however I'd like to know your thoughts before I continue because my further feedback would depend upon your thoughts. ;-)

Thank you for the awesome work you've done so far!

raise error.PyAsn1Error('Decoder not implemented for %s' % (tagSet,))
"""Decode value with fixed byte length.

If the decoder does not consume a precise byte length,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to disallow partial decoding here?

What if we have a long SEQUENCE OF encoded using definite length algorithm - would not it be more usable if, when partial substrate is given, we'd decode as much and yield once we can't proceed anymore?

I imagine this yielding feature could look like this:

  • the decoder returns nothing when yielding
  • the decoder seeks input stream back to where it has started reading this fragment
  • the decoder accumulates partially decoded objects in its call stack

Python coroutine/generator abstraction seems suitable for such design.

WDYT?

raise error.PyAsn1Error('Indefinite length mode decoder not implemented for %s' % (tagSet,))
"""Decode value with undefined length.

The decoder is allowed to consume as many bytes as necessary.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note: this separation of value decoding functions has been made to support different encoding schemas - definite length encoding and indefinite length encoding. Originally, it's not so much about decoding behavior, rather about implementing slightly different encoding algorithms.

However I am not certain that this separation is actually worth it.

Noting this just for the context.

raise error.PyAsn1Error('Indefinite length mode decoder not implemented for %s' % (tagSet,))
"""Decode value with undefined length.

The decoder is allowed to consume as many bytes as necessary.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in case when incomplete substrate is given on input? It has never been a requirement for the user to supply well-framed substrate. It is very well possible, especially in the indefinite length context.

An example use-case I can offer would be media streaming over the network - individual scalars (packed into the indefinite length container) are never guaranteed to arrive in full. Here is how BER indefinite length encoding looks like for OCTET STRING:

<indef octetsting <def octetsting> <def octetsting> ... <def octetsting> <end of octets sentinel>>

So the decoder should consume whatever is complete, and hold off the rest till the missing pieces arrive. That's the reason why definite length decoder returns unprocessed substrate.

Does my explanation make sense?

raise TypeError("Cannot convert " + substrate.__class__.__name__ + " to seekable bit stream.")


def endOfStream(substrate):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we call this function in the hot spots, it may become a dragger. What if instead of checking for the EOF condition, we would just go ahead reading from the stream and throw/handle exception if EOF condition occurs?

The motivation is to save on function call(s) if we gonna do that too often...

The other potential benefit would be if we ever push unprocessed substrate back to the stream - in that case EOF condition would become intermittent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called only in one place. You're right that we can probably remove it.

raise error.PyAsn1Error('Decoder not implemented for %s' % (tagSet,))
"""Decode value with fixed byte length.

If the decoder does not consume a precise byte length,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other thought is this: although it does not make much practical sense, I think it is not prohibited to mix definite and indefinite length encoded pieces. You can get definitely encoded SEQUENCE with indefinitely encoded components inside. The other way round is typical - definite length items inside indefinite length "container".

I am noting that to make sure we do not impose conflicting requirements on valueDecoder / indefValueDecoder functions having in mind that one can (indirectly) call the other and vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we are not imposing conflicting requirements. Are we?

if trailingBits > 7:
raise error.PyAsn1Error(
'Trailing bits overflow %s' % trailingBits
)

value = self.protoComponent.fromOctetString(
head[1:], internalFormat=True, padding=trailingBits)
substrate.read(length - 1), internalFormat=True, padding=trailingBits)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the yielding idea on substrate underrun condition: may be we could do something like this here and everywhere where read is concerned:

theBytes = substrate.read(length - 1)

if theBytes is None:  # indicates data starvation in O_NONBLOCK mode
    yield  # yields control back to the caller at this point, the caller can resume it from here

if len(theBytes) < length - 1:
    substrate.seek(-len(theBytes), SEEK_CUR)  # unread incomplete input
    yield

# otherwise go ahead 

The overall behavior would depend on the underlying file descriptor mode (blocking vs non-blocking) which is totally in the hands of the caller. What we could probably do is to support both cases internally e.g.:

Blocking mode:

  • blocking on read(count)
  • yielding when read count is less than desired
  • yielding when end-of-stream is not yet reached

Non-blocking mode:

  • yielding on input starvation
  • yielding when read count is less than desired
  • yielding when end-of-stream is not yet reached

At the high level, the decoding would look like this:

for asn1Object in decoder(stream):
    if asn1Object is None:
        # incomplete input, wait for it (e.g. via select(stream.fileno)) and retry
        continue
    print('received ', asn1Object)

Where decoder(stream) would produce an iterable (generator) which would emit ASN.1 structures till input stream is exhausted.

If we go this way, I hope we could still preserve backward compatibility in the decode function, while decodeStream would exhibit all these shiny new features...

In the next major release we could move decodeStream to decode, as you've suggested.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit that I am not so versed in coroutines, generators, etc. I cannot imagine how this would work or be implemented. If you can do this, I will gladly observe and learn.

@etingof
Copy link
Owner

etingof commented Sep 7, 2019

They work luckily. I benchmarked them just briefly. The total runtime of tests (my only measure) was perhaps a bit lower, but well inside the std.dev. I can do more serious testing on Monday.

Are you using decode() interface or decodeStream() in your testing? I am curious how it would work performance-wise if you feed decodeStream() a real file (i.e. native, disk file backed stream). With BytesIO layer we have everything in-memory at all times which might be faster for smaller files, but may become a limiting factor once the input does not fit available memory.

@janpipek
Copy link
Contributor Author

BTW, I've stashed a 4MB BER blob exactly for such purpose. I can't recall now what ASN.1 data structures it has inside. If you decide trying it out and it fails to decode, I can dig deeper - I think I managed to decode them, but it was very slow!

Unfortunately, it won't parse without the specs :-( Do you think you would be able to find them? Thanks.

@etingof
Copy link
Owner

etingof commented Sep 10, 2019

Unfortunately, it won't parse without the specs :-( Do you think you would be able to find them?

While looking up the spec, I run into another stress testing ground. This link leads you to a collection of up to 43MB blobs of CRLs. They should be decodable with this script:

import sys

from pyasn1_modules import rfc2459, pem
from pyasn1.codec.der import decoder

if len(sys.argv) != 1:
    print("""Usage:
$ cat crl.pem | %s""" % sys.argv[0])
    sys.exit(1)

asn1Spec = rfc2459.CertificateList()

substrate = open(sys.argv[0], 'rb').read()

key, substrate = decoder.decode(substrate, asn1Spec=asn1Spec)

print(key.prettyPrint())

@codecov-io
Copy link

codecov-io commented Sep 10, 2019

Codecov Report

Merging #176 into master will decrease coverage by 0.18%.
The diff coverage is 86.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #176      +/-   ##
==========================================
- Coverage   85.68%   85.49%   -0.19%     
==========================================
  Files          25       26       +1     
  Lines        4324     4662     +338     
==========================================
+ Hits         3705     3986     +281     
- Misses        619      676      +57
Impacted Files Coverage Δ
pyasn1/error.py 100% <100%> (ø) ⬆️
pyasn1/codec/cer/encoder.py 86.01% <100%> (+0.4%) ⬆️
pyasn1/codec/der/encoder.py 83.87% <100%> (+2.38%) ⬆️
pyasn1/codec/native/encoder.py 85% <81.81%> (-0.87%) ⬇️
pyasn1/codec/ber/decoder.py 84.26% <83.52%> (-1.8%) ⬇️
pyasn1/codec/ber/encoder.py 84.11% <90.9%> (+0.08%) ⬆️
pyasn1/codec/streaming.py 91.3% <91.3%> (ø)
pyasn1/codec/cer/decoder.py 93.18% <93.1%> (-0.57%) ⬇️
pyasn1/codec/der/decoder.py 96.42% <95.23%> (+1.69%) ⬆️
pyasn1/codec/native/decoder.py 86.48% <96.15%> (+1.63%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f10434e...cda318a. Read the comment docs.

@janpipek
Copy link
Contributor Author

janpipek commented Sep 10, 2019

While looking up the spec, I run into another stress testing ground. This link leads you to a collection of up to 43MB blobs of CRLs. They should be decodable with this script:

The streamed version takes ~2 minutes, the non-streamed one 1 hour. Which is consistent with my expectation of a roughly quadratic behaviour in size.

image

@janpipek
Copy link
Contributor Author

janpipek commented Sep 10, 2019

In e27f971, I am adding a wrapper around non-seekable streams. It will make them slower (added ~ 10 % to execution time when wrapping regular files that do not need this functionality) but at least working (the only non-seekable stream I can uses is a file read directly from a ZIP archive in Python 2.7 and that passes the test).

Coupled with wrapping old files with BufferedStream and not BytesIO, we should be getting close to the limit of optimizations and genericity.

@janpipek
Copy link
Contributor Author

Fun fact: the code base dates back to 2005 - non-Pythonic times at all!

Hats off!

janpipek and others added 23 commits November 15, 2019 19:31
The goal of this change is to make the decoder stopping on input
data starvation and resuming from where it stopped whenever the
caller decides to try again (hopefully making sure that some more
input becomes available).

This change makes it possible for the decoder to operate on streams
of data (meaning that the entire DER blob might not be immediately
available on input).

On top of that, the decoder yields partially reconstructed ASN.1
object on input starvation making it possible for the caller to
inspect what has been decoded so far and possibly consume partial
ASN.1 data.

All these new feature are natively available through
`StreamingDecoder` class. Previously published API is implemented
as a thin wrapper on top of that ensuring backward compatibility.
Try to reuse `SingleItemDecoder` object to leverage its caches.
Make it looking more uniform and easier to override if needed.
This change should simplify decoder specialization by means
of parameterization in addition to subclassing.
Turn BER decoder into a suspendible generator
@etingof
Copy link
Owner

etingof commented Nov 23, 2019

Thank you!

@janpipek
Copy link
Contributor Author

Nice, thank you. I hope I will get some time to have another look on this on Monday. Is there something particular necessary regarding this MR?

@etingof
Copy link
Owner

etingof commented Nov 23, 2019

I am going to merge it, otherwise merging other pending changes would be a hassle going forward.

You are very welcome to take another look and push another PR if you find something to improve there.

@etingof etingof merged commit 96b0a77 into etingof:master Nov 23, 2019
@janpipek
Copy link
Contributor Author

Nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants