Compilation feature¶
Warning
This feature is fully implemented but may not be fully mature.
Overall¶
Construct 2.9 adds an experimental feature: compiling user made constructs into much faster (but less feature-rich) code. If you are familiar with Kaitai Struct, an alternative framework to Construct, Kaitai compiles yaml-based schemas into pure Python modules. Construct on the other hand, defines schemas in pure Python and compiles them into pure Python modules. Once you define a construct, you can use it to parse and build blobs without compilation. Compilation has only one purpose: performance.
It should be made clear that currently the compiler supports only parsing and building. Sizeof is deferred to original construct, from which a compiled instance was made.
Requirements¶
Compilation feature requires Construct 2.9 for compiled parsing and Construct 2.10 for compiled building, preferrably the newest version to date. More importantly, you should have a test suite of your own. Construct aims to be reliable, but the compiler makes a lot of undocumented assumptions, and generates a code that “takes shortcuts” a lot. Since some checks are ommited by generated code, you should not use it to parse corrupted or untrusted data.
Restrictions¶
Compiled classes only parse and build faster, sizeof defers to core classes
Sizeof is applied during compilation (not during parsing and building)
Lambdas (unlike this expressions) are not supported
Exceptions do not include path information
enum34 is not supported, please use pure Enum constructions
_index context entry is not supported, neither is Index class
Struct Sequence FocusedSeq Union LazyStruct do not support _subcons _stream context entries
Parsed hooks are not supported, so is discard option, ignored
Debugger is not supported, ignored
Compiling schemas¶
Every construct (even those that do not compile) has a parameter-less compile method that returns also a construct (instance of Compiled class). It may be a good idea to compile something that is used for processing megabyte-sized data or millions of blobs. That compiled instance has parse and build methods just like the construct it was compiled from. Therefore, in your code, you can simply reassign the compiled instance over the original one.
>>> st = Struct("num" / Byte)
>>> st.parse(b"\x01")
Container(num=1)
>>> st = st.compile(filename="copyforinspection.py")
>>> st.parse(b"\x01")
Container(num=1)
Performance boost can be easily measured. This method also happens to be testing the correctness of the compiled parser, by making sure that both original and compiled instance parse into same results.
>>> print(st.benchmark(sampledata))
Compiled instance performance:
parsing: 0.0001288388 sec/call
parsing compiled: 0.0000452531 sec/call
building: 0.0001240775 sec/call
building compiled: 0.0001062776 sec/call
Motivation¶
The code generated by compiler and core classes have essentially same functionality, but there is a noticable difference in performance. First half of performance boost is thanks to pre-processing, as shown in this chapter. Pre-processing means inserting constants instead of variable lookups, constants means just variables that are known at compile time. The second half is thanks to pypy. This chapter explains the performance difference by comparing Struct FormatField BytesInteger Bytes classes, including using the context. Example construct:
Struct(
"num8" / Int8ub,
"num24" / Int24ub,
"data" / Bytes(this.num8),
)
Compiled parsing code:
def read_bytes(io, count):
assert count >= 0
data = io.read(count)
assert len(data) == count
return data
def parse_struct_1(io, this):
this = Container(_ = this)
try:
this['num8'] = unpack('>B', read_bytes(io, 1))[0]
this['num24'] = int.from_bytes(read_bytes(io, 3), byteorder='big', signed=False)
this['data'] = read_bytes(io, this.num8)
except StopIteration:
pass
del this['_']
return this
def parseall(io, this):
return parse_struct_1(io, this)
compiledschema = Compiled(None, None, parseall)
Non-compiled parsing code:
def _read_stream(stream, length):
if length < 0:
raise StreamError("length must be non-negative, found %s" % length)
try:
data = stream.read(length)
except Exception:
raise StreamError("stream.read() failed, requested %s bytes" % (length,))
if len(data) != length:
raise StreamError("could not read enough bytes, expected %d, found %d" % (length, len(data)))
return data
class FormatField(Construct):
def _parse(self, stream, context, path):
data = _read_stream(stream, self.length)
try:
return struct.unpack(self.fmtstr, data)[0]
except Exception:
raise FormatFieldError("struct %r error during parsing" % self.fmtstr)
class BytesInteger(Construct):
def _parse(self, stream, context, path):
length = self.length(context) if callable(self.length) else self.length
data = _read_stream(stream, length)
if self.swapped:
data = data[::-1]
return bytes2integer(data, self.signed)
class Bytes(Construct):
def _parse(self, stream, context, path):
length = self.length(context) if callable(self.length) else self.length
return _read_stream(stream, length)
class Renamed(Subconstruct):
def _parse(self, stream, context, path):
path += " -> %s" % (self.name,)
return self.subcon._parse(stream, context, path)
class Struct(Construct):
def _parse(self, stream, context, path):
obj = Container()
context = Container(_ = context)
context._subcons = Container({sc.name:sc for sc in self.subcons if sc.name})
for sc in self.subcons:
try:
subobj = sc._parse(stream, context, path)
if sc.name:
obj[sc.name] = subobj
context[sc.name] = subobj
except StopIteration:
break
return obj
There are several “shortcuts” that the compiled code does:
Function calls are relatively expensive, so an inlined expression is faster than a function returning the same exact expression. Therefore FormatField compiles into struct.unpack(…, read_bytes(io, …)) directly.
Literals like 1 and ‘>B’ are faster than object field lookup, dictionary lookup, or passing function arguments. Therefore each instance of FormatField compiles into a similar expression but with different format-strings and byte-counts inlined, usually literals.
Passing parameters to functions is slower than just referring to variables in same scope. Therefore, for example, compiled Struct creates “this” variable that is accessible to all expressions generated by subcons, as it exists in same scope, but core Struct would call subcon._parse and pass entire context as parameter value, regardless whether that subcon even uses a context (for example FormatField VarInt have no need for a context). Its similar but not exactly the same with “restream” function. The lambda in second parameter is rebounding io to a different object (a stream that gets created inside restream function). On the other hand, this is not rebounded, it exists in outer scope.
If statement (or conditional ternary operator) with two possible expressions and a condition that could be evaluated at compile-time is slower than just one or the other expression. Therefore, for example, BytesInteger does a lookup to check if field is swapped, but compiled BytesInteger simply inlines ‘big’ or ‘little’ literal. Moreover, Struct checks if each subcon has a name and then inserts a value into the context dictionary, but compiled Struct simply has an assignment or not. This shortcut also applies to most constructs, those that accept context lambdas as parameters. Generated classes do not need to check if a parameter is a constant or a lambda, because what gets emitted is either something like “1” which is a literal, or something like “this.field” which is an object lookup. Both are valid expressions and evaluate without red tape, or checks.
Looping over an iterable is slower than a block of code that accesses each item once. The reason its slower is that each iteration must fetch another item, and also check termination condition. Loop unrolling technique requires the iterable (or list rather) to be known at compile-time, which is the case with Struct and Sequence instances. Therefore, compiled Struct emits one line per subcon, but core Struct loops over its subcons.
Function calls that only defer to another function are only wasting CPU cycles. This relates specifically to Renamed class, which in compiled code emits same code as its subcon. Entire functionality of Renamed class (maintaining path information) is not supported in compiled code, where it would serve as mere subconstruct, just deferring to subcon.
Building two identical dictionaries is slower than building just one. Struct maintains two dictionaries (called obj and context) which differ only by _ key, but compiled Struct maintains only one dictionary and removes the _ key before returning it.
This expressions (not lambdas) are expensive to compute in regular code but something like “this.field” in a compiled code is merely one object field lookup. Same applies to len_ obj_ list_ expressions since they share the implementation with this expression.
Container is an implementation of so called AttrDict. It captures access to its attributes (field in this.field) and treats it as dictionary key access (this.field becomes this[“field”]). However, due to internal CPython drawbacks, capturing attribute access involves some red tape, unlike accessing keys, which is done directly. Therefore compiled Struct emits lines that assign to Container keys, not attributes.
Empirical evidence¶
The “shortcuts” that are described above are not much, but amount to quite a large portion of actual run-time. In fact, they amount to about a third (31%) of entire run-time. Note that this benchmark includes only pure-python compile-time optimisations.
Notice that results are in microseconds (10**-6).
-------------------------------- benchmark: 158 tests --------------------------------
Name (time in us) Min StdDev
--------------------------------------------------------------------------------------
test_class_array_parse 284.7820 (74.05) 31.0403 (118.46)
test_class_array_parse_compiled 73.6430 (19.15) 10.7624 (41.07)
test_class_greedyrange_parse 325.6610 (84.67) 31.8383 (121.50)
test_class_greedyrange_parse_compiled 300.9270 (78.24) 24.0149 (91.65)
test_class_repeatuntil_parse 10.2730 (2.67) 0.8322 (3.18)
test_class_repeatuntil_parse_compiled 7.3020 (1.90) 1.3155 (5.02)
test_class_string_parse 21.2270 (5.52) 1.3555 (5.17)
test_class_string_parse_compiled 18.9030 (4.91) 1.6023 (6.11)
test_class_cstring_parse 10.9060 (2.84) 1.0971 (4.19)
test_class_cstring_parse_compiled 9.4050 (2.45) 1.6083 (6.14)
test_class_pascalstring_parse 7.9290 (2.06) 0.4959 (1.89)
test_class_pascalstring_parse_compiled 6.6670 (1.73) 0.6601 (2.52)
test_class_struct_parse 43.5890 (11.33) 4.4993 (17.17)
test_class_struct_parse_compiled 18.7370 (4.87) 2.0198 (7.71)
test_class_sequence_parse 20.7810 (5.40) 2.6298 (10.04)
test_class_sequence_parse_compiled 11.9820 (3.12) 3.2669 (12.47)
test_class_union_parse 91.0570 (23.68) 10.2126 (38.97)
test_class_union_parse_compiled 31.9240 (8.30) 3.5955 (13.72)
test_overall_parse 3,200.7850 (832.23) 224.9197 (858.34)
test_overall_parse_compiled 2,229.9610 (579.81) 118.2029 (451.09)
--------------------------------------------------------------------------------------
Motivation, part 2¶
The second part of optimisation is just running the generated code on pypy. Since pypy is not using any type annotations, there is nothing to discuss in this chapter. The benchmark reflects the same code as in previous chapter, but ran on Pypy 2.7 rather than CPython 3.6.
Empirical evidence¶
Notice that results are in nanoseconds (10**-9).
------------------------------------- benchmark: 152 tests ------------------------------------
Name (time in ns) Min StdDev
-----------------------------------------------------------------------------------------------
test_class_array_parse 11,042.9974 (103.52) 40,792.8559 (46.97)
test_class_array_parse_compiled 9,088.0058 (85.20) 43,001.3909 (49.52)
test_class_greedyrange_parse 14,402.0014 (135.01) 49,834.2047 (57.38)
test_class_greedyrange_parse_compiled 9,801.0059 (91.88) 39,296.4529 (45.25)
test_class_repeatuntil_parse 318.4996 (2.99) 2,469.5524 (2.84)
test_class_repeatuntil_parse_compiled 309.3746 (2.90) 103,425.2134 (119.09)
test_class_string_parse 966.8991 (9.06) 537,241.0095 (618.62)
test_class_string_parse_compiled 726.6994 (6.81) 3,719.2657 (4.28)
test_class_cstring_parse 782.2993 (7.33) 4,111.8970 (4.73)
test_class_cstring_parse_compiled 591.1992 (5.54) 479,164.9746 (551.75)
test_class_pascalstring_parse 465.0911 (4.36) 4,262.4397 (4.91)
test_class_pascalstring_parse_compiled 298.4118 (2.80) 122,279.2150 (140.80)
test_class_struct_parse 2,633.9985 (24.69) 14,654.3095 (16.87)
test_class_struct_parse_compiled 949.7991 (8.90) 4,228.2890 (4.87)
test_class_sequence_parse 1,310.6008 (12.29) 5,811.8046 (6.69)
test_class_sequence_parse_compiled 732.2000 (6.86) 4,703.9483 (5.42)
test_class_union_parse 5,619.9933 (52.69) 30,590.0630 (35.22)
test_class_union_parse_compiled 2,699.9987 (25.31) 15,888.8206 (18.30)
test_overall_parse 1,332,581.9891 (>1000.0) 2,274,995.4192 (>1000.0)
test_overall_parse_compiled 690,380.0095 (>1000.0) 602,697.9721 (694.00)
-----------------------------------------------------------------------------------------------
Motivation, part 3¶
Warning
Benchmarks revealed that pypy makes the code run much faster than cython, therefore cython improvements were withdrawn, and compiler now generates pure python code that is compatible with Python 2 including pypy. This chapter is no longer relevant. It remained just for educational purposes.
This chapter talks about the second half of optimisation, which is due to Cython type annotations and type inference. I should state for the record, that I am no expert at Cython, and following explanatations are merely “the way I understand it”. Please take that into account when reading it. Fourth example:
Struct(
"num1" / Int8ul,
"num2" / Int24ul,
"fixedarray1" / Array(3, Int8ul),
"name1" / CString("utf8"),
)
cdef bytes read_bytes(io, int count):
if not count >= 0: raise StreamError
cdef bytes data = io.read(count)
if not len(data) == count: raise StreamError
return data
cdef bytes parse_nullterminatedstring(io, int unitsize, bytes finalunit):
cdef list result = []
cdef bytes unit
while True:
unit = read_bytes(io, unitsize)
if unit == finalunit:
break
result.append(unit)
return b"".join(result)
def parse_struct_1(io, this):
this = Container(_ = this)
try:
this['num1'] = unpack('<B', read_bytes(io, 1))[0]
this['num2'] = int.from_bytes(read_bytes(io, 3), byteorder='little', signed=False)
this['fixedarray1'] = ListContainer((unpack('<B', read_bytes(io, 1))[0]) for i in range(3))
this['name1'] = (parse_nullterminatedstring(io, 1, b'\x00')).decode('utf8')
pass
except StopIteration:
pass
del this['_']
del this['_index']
return this
def parseall(io, this):
return parse_struct_1(io, this)
compiled = Compiled(None, None, parseall)
The primary cause of speedup in cython is this: if a variable is of known type, then operations on that variable can skip certain checks. If a variable is a pure python object, then those checks need to be added. A variable is considered of known type if either (1) its annotated like “cdef bytes data” or (2) its inferred like when using an annotated function call result like in “parse_nullterminatedstring(…).decode(…)” since “cdef bytes parse_nullterminatedstring(…)”. If a variable is known to be a list, then calling “append” on it doesnt require checking if that object has such a method or matching signature (parameters). If a variable is known to be a bytes, then “len(data)” can be compiled into bytes-type length function, not a general-purpose length function that works on arbitrary objects, and also “unit == finalunit” can be compiled into bytes-type equality. If a variable is known to be a unicode, then “.decode(‘utf8’)” can be compiled into str-type implementation. If cython knows that “struct.unpack” returns only tuples, then “…[0]” would compile into tuple-type getitem (index access). Examples are many, but the pattern is the same: type-specific code is faster than type-general code.
Second cause of speedup is due to special handling of integers. While most annotations like “cdef bytes” refer to specific albeit Python types, the “cdef int” actually does not refer to any Python type. It represents a C-integer which is allocated on the stack or in registers, unlike the other types which are allocated on the heap. All operations on C-integers are therefore much faster than on Python-integers. In example code, this affects “count >= 0” and “len(data) == count”.
Empirical evidence¶
Below micro-benchmarks show the difference between core classes and cython-compiled classes. Only those where performance boost was highest are listed (although they also happen to be the most important), some other classes have little speedup, and some have none.
Notice that results are in microseconds (10**-6).
------------------------------- benchmark: 152 tests -------------------------------
Name (time in us) Min StdDev
------------------------------------------------------------------------------------
test_class_array_parse 286.5460 (73.85) 42.8831 (89.84)
test_class_array_parse_compiled 30.7200 (7.92) 6.9577 (14.58)
test_class_greedyrange_parse 320.9860 (82.73) 45.9480 (96.26)
test_class_greedyrange_parse_compiled 262.7010 (67.71) 36.4504 (76.36)
test_class_repeatuntil_parse 10.1850 (2.63) 2.4147 (5.06)
test_class_repeatuntil_parse_compiled 6.8880 (1.78) 1.5471 (3.24)
test_class_string_parse 20.4400 (5.27) 4.4044 (9.23)
test_class_string_parse_compiled 9.1470 (2.36) 2.2427 (4.70)
test_class_cstring_parse 11.2290 (2.89) 1.6216 (3.40)
test_class_cstring_parse_compiled 5.6080 (1.45) 1.0321 (2.16)
test_class_pascalstring_parse 7.8560 (2.02) 1.8567 (3.89)
test_class_pascalstring_parse_compiled 5.8910 (1.52) 0.9466 (1.98)
test_class_struct_parse 44.1300 (11.37) 6.8434 (14.34)
test_class_struct_parse_compiled 16.9070 (4.36) 3.0500 (6.39)
test_class_sequence_parse 21.5420 (5.55) 2.6852 (5.63)
test_class_sequence_parse_compiled 10.1530 (2.62) 2.1645 (4.53)
test_class_union_parse 91.9150 (23.69) 10.7812 (22.59)
test_class_union_parse_compiled 22.5970 (5.82) 15.2649 (31.98)
test_overall_parse 2,126.2570 (548.01) 255.0154 (534.27)
test_overall_parse_compiled 1,124.9560 (289.94) 127.4730 (267.06)
------------------------------------------------------------------------------------
Comparison with Kaitai Struct¶
Kaitai Struct is a very respectable competitor, so I believe a benchmark-based comparison should be presented. Construct and Kaitai have very different capabilities: Kaitai supports about a dozen languages, Construct only supports Python, Kaitai offers only basic common features, Construct offers python-only stuff like Numpy and Pickle support, Kaitai does only parsing, Construct does also building. In a sense, those libraries are in two different categories (like sumo and karate). There are multiple scenarios where either library would not be usable.
Example used for comparison:
Struct(
"count" / Int32ul,
"items" / Array(this.count, Struct(
"num1" / Int8ul,
"num2" / Int24ul,
"flags" / BitStruct(
"bool1" / Flag,
"num4" / BitsInteger(3),
Padding(4),
),
"fixedarray1" / Array(3, Int8ul),
"name1" / CString("utf8"),
"name2" / PascalString(Int8ul, "utf8"),
)),
)
meta:
id: comparison_1_kaitai
encoding: utf-8
endian: le
seq:
- id: count
type: u4
- id: items
repeat: expr
repeat-expr: count
type: item
types:
item:
seq:
- id: num1
type: u1
- id: num2_lo
type: u2
- id: num2_hi
type: u1
- id: flags
type: flags
- id: fixedarray1
repeat: expr
repeat-expr: 3
type: u1
- id: name1
type: strz
- id: len_name2
type: u1
- id: name2
type: str
size: len_name2
instances:
num2:
value: 'num2_hi << 16 | num2_lo'
types:
flags:
seq:
- id: bool1
type: b1
- id: num4
type: b3
- id: padding
type: b4
Suprisingly, Kaitai won the benchmark! Honestly, I am shocked and dismayed that it did. The only explanation that I can point out, is that Kaitai is parsing structs into class objects (with attributes) while Construct parses into dictionaries (with keys). However that one detail seems unlikely explanation for the huge discrepancy in benchmark results. Perhaps there is a flaw in the methodology. But until that is proven, Kaitai gets its respects. Congrats.
$ python3.6 comparison_1_construct.py
Timeit measurements:
parsing: 0.1024609069 sec/call
parsing compiled: 0.0410809368 sec/call
$ pypy comparison_1_construct.py
Timeit measurements:
parsing: 0.0108308416 sec/call
parsing compiled: 0.0062594243 sec/call
$ python3.6 comparison_1_kaitai.py
Timeit measurements:
parsing: 0.0250326035 sec/call
$ pypy comparison_1_kaitai.py
Timeit measurements:
parsing: 0.0019435351 sec/call