Streamed version of ber.decoder.decode #176

janpipek · 2019-09-06T13:15:46Z

(An updated version of #175):

I implemented the functionality so that instead of copying strings, it passes a ~~BytesIO object~~ stream whenever possible. It should solve the #174 issue.

Changes in API:

decodeStream in the modules accepts a stream directly
decode behaves as it did before, constructing a new BytesIO object and passing it to decodeStream

Thus, no API should break unless the author decides so (which I suggest as another issue).

I tested heavily with Python 2.7 and 3.7, CI should care of the rest.

test for open() results added. If any other kind of stream is supposed to be used, it can (and should be added).

Performance:

for the tiniest objects (like "false"), the new decode is ~5 % slower.
for example objects of 10-50 bytes, it's ~0-2 % faster.
for a 2MB file containing many objects, it's ~5 times faster.

Note: As file object in Python 2.7 does not allow direct setting of attributes, it's first read into memory in this version. A trivial patch, that slows the overall performance by 2-3 %, but avoids pre-reading the file into memory, can be provided.

What was considered as well (but did not work / was not performing enough):

using memoryview (this could potentially work with bytes if the code were refactored a lot)
wrapping input in a custom stream-like object that keeps the content and position - the cost of creating objects and accessing attributes was too high and still all streams were not covered.

Please, let me know what changes are there still to be made.
Thanks
Jan

Rewrite Decoder in terms of BytesIO BER Decoders implemented with BytesIO but for the most complex BER UniversalConstructedTypeDecoder in terms of BytesIO BER Decoder (stream-based) suggestion Fixed some of the failing tests Fixed several failed tests Fix all remaining tests but the non-implemented Any Implement untagged Any with back-seek Fix cer and der to work with streams Simplify unnecessary added complexity Make use of IOBase hierarchy (properly?) - in progress Tests failing Fixed most failing tests 1 remaining Severaů small optimizations Fix logging Note: As we do not want to read the whole stream, explicit output of remaining bytes is not used. Rename and document utility functions for BER decoder Fixed ínverted condition in BitStringDecoder.valueDecoder Fixed wrongly acquired fullPosition in AnyDecoder.indefLenValueDecoder Fixed logging None length endOfStream(BytesIO) working in 2.7 Microoptimizations for endOfStream (not using it) Test for checking binary files as substrate Python 2.7 BytesIO wrapper for `file`s Refactor keep API compatibility with original version

janpipek · 2019-09-06T13:39:10Z

Hi @etingof, I added you to the collaborators of my fork, so you can edit directly if interested.

etingof · 2019-09-06T17:17:05Z

Hi @etingof, I added you to the collaborators of my fork, so you can edit directly if interested.

That looks awesome! Let me check out the code in depth.

etingof · 2019-09-06T17:43:43Z

I think it would make a huge sense to run your performance tests against real BER blobs (referring to pyasn1-modules tests). Have you succeeded getting them to work?

BTW, I've stashed a 4MB BER blob exactly for such purpose. I can't recall now what ASN.1 data structures it has inside. If you decide trying it out and it fails to decode, I can dig deeper - I think I managed to decode them, but it was very slow!

janpipek · 2019-09-06T17:49:39Z

They work luckily. I benchmarked them just briefly. The total runtime of tests (my only measure) was perhaps a bit lower, but well inside the std.dev. I can do more serious testing on Monday. The blob sounds perfect, we will see...

etingof

I've gone through a small portion of the patch, let me feed my kid and continue afterwards. In the meanwhile, I'm sending this incomplete review in case you have time to consider my thoughts.

tests/codec/ber/test_decoder.py

etingof · 2019-09-06T17:31:39Z

pyasn1/codec/der/decoder.py

@@ -4,10 +4,13 @@
 # Copyright (c) 2005-2019, Ilya Etingof <[email protected]>
 # License: http://snmplabs.com/pyasn1/license.html
 #
+from io import BytesIO


Note to self: we need to drop Python 2.5 support.

pyasn1/codec/ber/decoder.py

etingof · 2019-09-06T17:52:50Z

pyasn1/codec/ber/decoder.py

+    elif isinstance(substrate, univ.OctetString):
+        return BytesIO(substrate.asOctets())
+    try:
+        if _PY2 and isinstance(substrate, file):


Arguably, explicit check for file type is not very Pythonic in nature. Perhaps the most usable API would be if we could consume any file-like object (e.g. stream socket)...

The problem is that we can easily set attribute to a "Pythonic" object like BufferedReader but cannot do so with C-implemented file in Python 2.

As mentioned, there is a solution that passes the backtrack position in the options (and then does not need this) but it proved ~2 % slower in Python for small objects. Depends on your decision.

pyasn1/codec/ber/decoder.py

etingof · 2019-09-06T17:57:54Z

pyasn1/codec/ber/decoder.py

+        return BytesIO(substrate.asOctets())
+    try:
+        if _PY2 and isinstance(substrate, file):
+            return BytesIO(substrate.read())  # Not optimal for really large files


Now I think it makes perfect sense to use file as is because it's already a perfectly valid and buffered (!) stream.

This is correct but at the price of speed (as mentioned above).

Yeah... On the other hand, read() on a large file might eat up process memory... The good thing about streams is that they keep up low on resources.

I put BufferedReader instead. That should make it better.

pyasn1/codec/ber/decoder.py

etingof

I have done some more progress in my review. It's not over yet, however I'd like to know your thoughts before I continue because my further feedback would depend upon your thoughts. ;-)

Thank you for the awesome work you've done so far!

etingof · 2019-09-07T06:38:12Z

pyasn1/codec/ber/decoder.py

-        raise error.PyAsn1Error('Decoder not implemented for %s' % (tagSet,))
+        """Decode value with fixed byte length.
+
+        If the decoder does not consume a precise byte length,


Is it important to disallow partial decoding here?

What if we have a long SEQUENCE OF encoded using definite length algorithm - would not it be more usable if, when partial substrate is given, we'd decode as much and yield once we can't proceed anymore?

I imagine this yielding feature could look like this:

the decoder returns nothing when yielding

the decoder seeks input stream back to where it has started reading this fragment

the decoder accumulates partially decoded objects in its call stack

Python coroutine/generator abstraction seems suitable for such design.

WDYT?

etingof · 2019-09-07T06:41:18Z

pyasn1/codec/ber/decoder.py

-        raise error.PyAsn1Error('Indefinite length mode decoder not implemented for %s' % (tagSet,))
+        """Decode value with undefined length.
+
+        The decoder is allowed to consume as many bytes as necessary.


Just a quick note: this separation of value decoding functions has been made to support different encoding schemas - definite length encoding and indefinite length encoding. Originally, it's not so much about decoding behavior, rather about implementing slightly different encoding algorithms.

However I am not certain that this separation is actually worth it.

Noting this just for the context.

pyasn1/codec/ber/decoder.py

etingof · 2019-09-07T06:48:06Z

pyasn1/codec/ber/decoder.py

-        raise error.PyAsn1Error('Indefinite length mode decoder not implemented for %s' % (tagSet,))
+        """Decode value with undefined length.
+
+        The decoder is allowed to consume as many bytes as necessary.


What happens in case when incomplete substrate is given on input? It has never been a requirement for the user to supply well-framed substrate. It is very well possible, especially in the indefinite length context.

An example use-case I can offer would be media streaming over the network - individual scalars (packed into the indefinite length container) are never guaranteed to arrive in full. Here is how BER indefinite length encoding looks like for OCTET STRING:

<indef octetsting <def octetsting> <def octetsting> ... <def octetsting> <end of octets sentinel>>

So the decoder should consume whatever is complete, and hold off the rest till the missing pieces arrive. That's the reason why definite length decoder returns unprocessed substrate.

Does my explanation make sense?

pyasn1/codec/ber/decoder.py

etingof · 2019-09-07T06:53:58Z

pyasn1/codec/ber/decoder.py

+        raise TypeError("Cannot convert " + substrate.__class__.__name__ + " to seekable bit stream.")
+
+
+def endOfStream(substrate):


If we call this function in the hot spots, it may become a dragger. What if instead of checking for the EOF condition, we would just go ahead reading from the stream and throw/handle exception if EOF condition occurs?

The motivation is to save on function call(s) if we gonna do that too often...

The other potential benefit would be if we ever push unprocessed substrate back to the stream - in that case EOF condition would become intermittent.

This method is called only in one place. You're right that we can probably remove it.

etingof · 2019-09-07T07:01:58Z

pyasn1/codec/ber/decoder.py

-        raise error.PyAsn1Error('Decoder not implemented for %s' % (tagSet,))
+        """Decode value with fixed byte length.
+
+        If the decoder does not consume a precise byte length,


My other thought is this: although it does not make much practical sense, I think it is not prohibited to mix definite and indefinite length encoded pieces. You can get definitely encoded SEQUENCE with indefinitely encoded components inside. The other way round is typical - definite length items inside indefinite length "container".

I am noting that to make sure we do not impose conflicting requirements on valueDecoder / indefValueDecoder functions having in mind that one can (indirectly) call the other and vice versa.

I hope we are not imposing conflicting requirements. Are we?

etingof · 2019-09-07T08:01:55Z

pyasn1/codec/ber/decoder.py

            if trailingBits > 7:
                raise error.PyAsn1Error(
                    'Trailing bits overflow %s' % trailingBits
                )

            value = self.protoComponent.fromOctetString(
-                head[1:], internalFormat=True, padding=trailingBits)
+                substrate.read(length - 1), internalFormat=True, padding=trailingBits)


Regarding the yielding idea on substrate underrun condition: may be we could do something like this here and everywhere where read is concerned:

theBytes = substrate.read(length - 1) if theBytes is None: # indicates data starvation in O_NONBLOCK mode yield # yields control back to the caller at this point, the caller can resume it from here if len(theBytes) < length - 1: substrate.seek(-len(theBytes), SEEK_CUR) # unread incomplete input yield # otherwise go ahead

The overall behavior would depend on the underlying file descriptor mode (blocking vs non-blocking) which is totally in the hands of the caller. What we could probably do is to support both cases internally e.g.:

Blocking mode:

blocking on read(count)

yielding when read count is less than desired

yielding when end-of-stream is not yet reached

Non-blocking mode:

yielding on input starvation

yielding when read count is less than desired

yielding when end-of-stream is not yet reached

At the high level, the decoding would look like this:

for asn1Object in decoder(stream): if asn1Object is None: # incomplete input, wait for it (e.g. via select(stream.fileno)) and retry continue print('received ', asn1Object)

Where decoder(stream) would produce an iterable (generator) which would emit ASN.1 structures till input stream is exhausted.

If we go this way, I hope we could still preserve backward compatibility in the decode function, while decodeStream would exhibit all these shiny new features...

In the next major release we could move decodeStream to decode, as you've suggested.

WDYT?

I must admit that I am not so versed in coroutines, generators, etc. I cannot imagine how this would work or be implemented. If you can do this, I will gladly observe and learn.

pyasn1/codec/ber/decoder.py

etingof · 2019-09-07T08:33:16Z

They work luckily. I benchmarked them just briefly. The total runtime of tests (my only measure) was perhaps a bit lower, but well inside the std.dev. I can do more serious testing on Monday.

Are you using decode() interface or decodeStream() in your testing? I am curious how it would work performance-wise if you feed decodeStream() a real file (i.e. native, disk file backed stream). With BytesIO layer we have everything in-memory at all times which might be faster for smaller files, but may become a limiting factor once the input does not fit available memory.

janpipek · 2019-09-10T08:34:35Z

BTW, I've stashed a 4MB BER blob exactly for such purpose. I can't recall now what ASN.1 data structures it has inside. If you decide trying it out and it fails to decode, I can dig deeper - I think I managed to decode them, but it was very slow!

Unfortunately, it won't parse without the specs :-( Do you think you would be able to find them? Thanks.

etingof · 2019-09-10T08:46:56Z

Unfortunately, it won't parse without the specs :-( Do you think you would be able to find them?

While looking up the spec, I run into another stress testing ground. This link leads you to a collection of up to 43MB blobs of CRLs. They should be decodable with this script:

import sys

from pyasn1_modules import rfc2459, pem
from pyasn1.codec.der import decoder

if len(sys.argv) != 1:
    print("""Usage:
$ cat crl.pem | %s""" % sys.argv[0])
    sys.exit(1)

asn1Spec = rfc2459.CertificateList()

substrate = open(sys.argv[0], 'rb').read()

key, substrate = decoder.decode(substrate, asn1Spec=asn1Spec)

print(key.prettyPrint())

codecov-io · 2019-09-10T11:41:05Z

Codecov Report

Merging #176 into master will decrease coverage by 0.18%.
The diff coverage is 86.74%.

@@            Coverage Diff             @@
##           master     #176      +/-   ##
==========================================
- Coverage   85.68%   85.49%   -0.19%     
==========================================
  Files          25       26       +1     
  Lines        4324     4662     +338     
==========================================
+ Hits         3705     3986     +281     
- Misses        619      676      +57

Impacted Files	Coverage Δ
pyasn1/error.py	`100% <100%> (ø)`	⬆️
pyasn1/codec/cer/encoder.py	`86.01% <100%> (+0.4%)`	⬆️
pyasn1/codec/der/encoder.py	`83.87% <100%> (+2.38%)`	⬆️
pyasn1/codec/native/encoder.py	`85% <81.81%> (-0.87%)`	⬇️
pyasn1/codec/ber/decoder.py	`84.26% <83.52%> (-1.8%)`	⬇️
pyasn1/codec/ber/encoder.py	`84.11% <90.9%> (+0.08%)`	⬆️
pyasn1/codec/streaming.py	`91.3% <91.3%> (ø)`
pyasn1/codec/cer/decoder.py	`93.18% <93.1%> (-0.57%)`	⬇️
pyasn1/codec/der/decoder.py	`96.42% <95.23%> (+1.69%)`	⬆️
pyasn1/codec/native/decoder.py	`86.48% <96.15%> (+1.63%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f10434e...cda318a. Read the comment docs.

janpipek · 2019-09-10T12:50:16Z

While looking up the spec, I run into another stress testing ground. This link leads you to a collection of up to 43MB blobs of CRLs. They should be decodable with this script:

The streamed version takes ~2 minutes, the non-streamed one 1 hour. Which is consistent with my expectation of a roughly quadratic behaviour in size.

janpipek · 2019-09-10T15:53:24Z

In e27f971, I am adding a wrapper around non-seekable streams. It will make them slower (added ~ 10 % to execution time when wrapping regular files that do not need this functionality) but at least working (the only non-seekable stream I can uses is a file read directly from a ZIP archive in Python 2.7 and that passes the test).

Coupled with wrapping old files with BufferedStream and not BytesIO, we should be getting close to the limit of optimizations and genericity.

janpipek · 2019-09-13T12:38:21Z

Fun fact: the code base dates back to 2005 - non-Pythonic times at all!

Hats off!

…eamoptimum37

The goal of this change is to make the decoder stopping on input data starvation and resuming from where it stopped whenever the caller decides to try again (hopefully making sure that some more input becomes available). This change makes it possible for the decoder to operate on streams of data (meaning that the entire DER blob might not be immediately available on input). On top of that, the decoder yields partially reconstructed ASN.1 object on input starvation making it possible for the caller to inspect what has been decoded so far and possibly consume partial ASN.1 data. All these new feature are natively available through `StreamingDecoder` class. Previously published API is implemented as a thin wrapper on top of that ensuring backward compatibility.

Co-Authored-By: Jan Pipek <[email protected]>

Try to reuse `SingleItemDecoder` object to leverage its caches.

…asn1 into support-incomplete-substrate

Make it looking more uniform and easier to override if needed.

This change should simplify decoder specialization by means of parameterization in addition to subclassing.

Turn BER decoder into a suspendible generator

etingof · 2019-11-23T08:58:07Z

Thank you!

janpipek · 2019-11-23T08:59:41Z

Nice, thank you. I hope I will get some time to have another look on this on Monday. Is there something particular necessary regarding this MR?

etingof · 2019-11-23T09:01:46Z

I am going to merge it, otherwise merging other pending changes would be a hassle going forward.

You are very welcome to take another look and push another PR if you find something to improve there.

janpipek · 2019-11-23T09:07:41Z

Nice.

janpipek added 2 commits September 6, 2019 14:54

API that work with pyasn1-modules

0758362

janpipek changed the title ~~Streamed version of ber.decoder.decode~~ WIP: Streamed version of ber.decoder.decode Sep 6, 2019

janpipek mentioned this pull request Sep 6, 2019

BytesIO used in BER (and CER, DER) decode #175

Closed

janpipek changed the title ~~WIP: Streamed version of ber.decoder.decode~~ Streamed version of ber.decoder.decode Sep 6, 2019

etingof reviewed Sep 6, 2019

View reviewed changes

etingof reviewed Sep 7, 2019

View reviewed changes

janpipek added 3 commits September 10, 2019 12:43

Fail with unseekable streams.

545b1b4

UnsupportedSubstrateError

6e0186a

Update tests with more streams for ber.decoder

a462fec

janpipek added 3 commits September 10, 2019 15:08

Trivial changes from the MR.

56f33ff

Docstrings in requested format.

043d97d

Implement _CachedStreamWrapper

e27f971

janpipek force-pushed the streamoptimum37 branch from 02660d0 to e27f971 Compare September 10, 2019 15:37

Additional test on ZIP files

85b6687

janpipek added 4 commits September 13, 2019 11:50

Address several pull requests comments + hide asSeekableStream

4b52340

Hide other auxiliary functions.

0fbc60e

Simplify _CachingStreamWrapper

ec49b52

CachingStreamWrapperTestCase

4d7d553

janpipek and others added 23 commits November 15, 2019 19:31

Update tests with more streams for ber.decoder

3cf920d

Trivial changes from the MR.

3c5ebb9

Docstrings in requested format.

21b4e64

Implement _CachedStreamWrapper

e279319

Additional test on ZIP files

8446ab5

Address several pull requests comments + hide asSeekableStream

c547dde

Hide other auxiliary functions.

21eb530

Simplify _CachingStreamWrapper

14a4b3c

CachingStreamWrapperTestCase

6ba15da

Merge branch 'streamoptimum37' of github.com:janpipek/pyasn1 into str…

fdd0bd6

…eamoptimum37

Update docstring on streaming.read

2e6e1ab

Co-Authored-By: Jan Pipek <[email protected]>

Update streaming.read docstring

831e97a

Co-Authored-By: Jan Pipek <[email protected]>

Optimize streaming objects access for performance

5283ebb

Add minor performance optimising changes

b75800d

Reuse SingleItemDecoder object in StreamingDecoder

3fb3fcf

Try to reuse `SingleItemDecoder` object to leverage its caches.

Fix Integer decoder to handle empty payload

3318f76

Merge branch 'support-incomplete-substrate' of github.com:janpipek/py…

f0c228e

…asn1 into support-incomplete-substrate

Fix streaming unit test

3e802d7

Refactor codec classes linkage

8393983

Make it looking more uniform and easier to override if needed.

Pass tagMap and typeMap to decoder instance

317452b

This change should simplify decoder specialization by means of parameterization in addition to subclassing.

Merge pull request #1 from janpipek/support-incomplete-substrate

788af3c

Turn BER decoder into a suspendible generator

Merge branch 'master' into streamoptimum37

2b646e0

Update CHANGES.rst

cda318a

etingof merged commit 96b0a77 into etingof:master Nov 23, 2019

haxtibal mentioned this pull request Nov 2, 2023

Fix 0.5.0 API breakage of decoder.decode(substrateFun=...) pyasn1/pyasn1#39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamed version of ber.decoder.decode #176

Streamed version of ber.decoder.decode #176

janpipek commented Sep 6, 2019 •

edited

Loading

janpipek commented Sep 6, 2019

etingof commented Sep 6, 2019

etingof commented Sep 6, 2019

janpipek commented Sep 6, 2019

etingof left a comment

etingof Sep 6, 2019

etingof Sep 6, 2019

janpipek Sep 10, 2019

etingof Sep 6, 2019 •

edited

Loading

janpipek Sep 10, 2019

etingof Sep 10, 2019

janpipek Sep 10, 2019

etingof left a comment

etingof Sep 7, 2019

etingof Sep 7, 2019

etingof Sep 7, 2019

etingof Sep 7, 2019

janpipek Sep 10, 2019

etingof Sep 7, 2019

janpipek Sep 10, 2019

etingof Sep 7, 2019

janpipek Sep 10, 2019

etingof commented Sep 7, 2019

janpipek commented Sep 10, 2019

etingof commented Sep 10, 2019 •

edited

Loading

codecov-io commented Sep 10, 2019 •

edited

Loading

janpipek commented Sep 10, 2019 •

edited

Loading

janpipek commented Sep 10, 2019 •

edited

Loading

janpipek commented Sep 13, 2019

etingof commented Nov 23, 2019

janpipek commented Nov 23, 2019

etingof commented Nov 23, 2019

janpipek commented Nov 23, 2019

		raise TypeError("Cannot convert " + substrate.__class__.__name__ + " to seekable bit stream.")


		def endOfStream(substrate):

Streamed version of ber.decoder.decode #176

Streamed version of ber.decoder.decode #176

Conversation

janpipek commented Sep 6, 2019 • edited Loading

janpipek commented Sep 6, 2019

etingof commented Sep 6, 2019

etingof commented Sep 6, 2019

janpipek commented Sep 6, 2019

etingof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etingof Sep 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etingof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etingof commented Sep 7, 2019

janpipek commented Sep 10, 2019

etingof commented Sep 10, 2019 • edited Loading

codecov-io commented Sep 10, 2019 • edited Loading

Codecov Report

janpipek commented Sep 10, 2019 • edited Loading

janpipek commented Sep 10, 2019 • edited Loading

janpipek commented Sep 13, 2019

etingof commented Nov 23, 2019

janpipek commented Nov 23, 2019

etingof commented Nov 23, 2019

janpipek commented Nov 23, 2019

janpipek commented Sep 6, 2019 •

edited

Loading

etingof Sep 6, 2019 •

edited

Loading

etingof commented Sep 10, 2019 •

edited

Loading

codecov-io commented Sep 10, 2019 •

edited

Loading

janpipek commented Sep 10, 2019 •

edited

Loading

janpipek commented Sep 10, 2019 •

edited

Loading