Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage with pyarrow parquet #10

Open
tanguycdls opened this issue Apr 15, 2021 · 2 comments
Open

Usage with pyarrow parquet #10

tanguycdls opened this issue Apr 15, 2021 · 2 comments

Comments

@tanguycdls
Copy link

Hello, I'm very interested by the library usage however I struggle to apply it to a parquet file other than the dremel example.

from struct2tensor import expression_impl
import struct2tensor as s2t
import pyarrow as pa
import pyarrow.parquet as pq

tbl = pa.table([pa.array([0, 1])], names='a')
pq.ParquetWriter('/tmp/test', tbl.schema).write_table(tbl)
filenames = ["/tmp/test"]
batch_size = 2

exp = s2t.expression_impl.parquet.create_expression_from_parquet_file(filenames)
ps = exp.project(['a'])

val = s2t.expression_impl.parquet.calculate_parquet_values([ps], exp, 
                                        filenames, batch_size)
for h in val:
    break

segfaults with the error:
2021-04-15 15:30:40.254237: E struct2tensor/kernels/parquet/parquet_reader.cc:198]
The repetition type of the root node was 0, but should be 2. There may be something wrong with your supplied parquet schema. We will treat it as a repeated field.

2021-04-15 15:31:46.428109: W tensorflow/core/framework/dataset.cc:477]
Input of ParquetDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.

I also tried saving again the dremel file loaded with Pyarrow and dumping it right away and I can reproduce the error.

How do you advise to save your parquet ?

Thanks for your help !

@andylou2
Copy link
Contributor

Hi Tanguy,

The dremel example was created with parquet's c++ api [1]. The last time I checked (~2 years ago), pyarrow's parquet writer/reader did not properly support structured data. But this could have changed.

Do you have the full stack trace? The errors you listed are not fatal errors.

[1] https://github.com/apache/parquet-cpp

@tanguycdls
Copy link
Author

Hello thanks for the answer !

It's actually a core dump SEGFAULT:
I tried gdb but i dont have the symbols and source configured so it's not so clear to me:

#0  _PyErr_GetTopmostException (tstate=0x0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/errors.c:98
#1  PyErr_SetObject (exception=0x55d759d71d00 <_PyExc_RuntimeError>, value=0x7f74d422b390) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/errors.c:98
#2  0x000055d759b15b4d in PyErr_SetString (exception=0x55d759d71d00 <_PyExc_RuntimeError>, string=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/errors.c:170
#3  0x00007f755e84cb78 in pybind11::detail::translate_exception(std::__exception_ptr::exception_ptr) () from /opt/conda/envs/model/lib/python3.7/site-packages/tensorflow/python/_pywrap_tfe.so
#4  0x00007f755e87ca1b in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /opt/conda/envs/model/lib/python3.7/site-packages/tensorflow/python/_pywrap_tfe.so
#5  0x000055d759b9c427 in _PyMethodDef_RawFastCallKeywords (method=<optimized out>, self=0x7f755ebeb210, args=0x55d75e4c1540, nargs=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:693
#6  0x000055d759b9dad8 in _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, args=0x55d75e4c1540, func=0x7f755ebea960) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:723
#7  call_function (pp_stack=0x7ffeaf5dd3c0, oparg=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4568
#8  0x000055d759bc874a in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3093
#9  0x000055d759b0baf2 in PyEval_EvalFrameEx (throwflag=0, f=0x55d75e4c1360) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
#10 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=<optimized out>, kwcount=<optimized out>, kwstep=<optimized out>, defs=<optimized out>, defcount=<optimized out>, kwdefs=<optimized out>, closure=<optimized out>, name=<optimized out>, qualname=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
#11 0x000055d759b3a030 in _PyFunction_FastCallKeywords (func=<optimized out>, stack=0x7f74e0065f68, nargs=1, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:433
#12 0x000055d759b9d9c8 in call_function (pp_stack=0x7ffeaf5dd6c0, oparg=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4616
#13 0x000055d759bc51d9 in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3139
#14 0x000055d759b39e94 in PyEval_EvalFrameEx (throwflag=0, f=0x7f74e0065de0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:544
#15 function_code_fastcall (globals=0x7f755c7d1140, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:283
#16 _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:408
#17 0x000055d759b9d9c8 in call_function (pp_stack=0x7ffeaf5dd8a0, oparg=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4616
#18 0x000055d759bc4544 in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3110
#19 0x000055d759b0cead in PyEval_EvalFrameEx (throwflag=0, f=0x7f74d4387050) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:544
(More stack frames follow...)

In python side there is nothing except the log right before.

I remember some conversations on Pyarrow ability to store those but I thought it was resolved. The parquet-cpp however seems to now be in Arrow repo.

I'll try to see if I can understand the difference between both format !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants