Skip to content Skip to sidebar Skip to footer

Apache Arrow, Alignment And Padding

I want to use apache arrow because it enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for nat

Solution 1:

The memory in Arrow is 64 byte aligned but in your example code, the conversion to Pandas/NumPy makes a copy of the data as a nested array of lists is differently represented in Arrow and in NumPy. In Arrow this is done using one buffer that holds the data of all lists while there is another buffer that holds the offsets for each list in that Array. As NumPy has no native list type, it is represented as a NumPy array that contains other NumPy arrays as elements. These are represented in the first NumPy array as Python objects.

Thus using the NumPy functions you see the memory as allocated by NumPy, not by Arrow. Thus if your memory address is on a 64 byte boundary, it is only by chance.

In the next version (0.9) of pyarrow there will be a buffers property to access the underlying memory addresses. You should then be able to directly check if the Arrow memory is allocated on a 64 byte aligned address (it always should be).

Post a Comment for "Apache Arrow, Alignment And Padding"