-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missing headers in csv output and enable schema inference in export-pg-from-queries #174
Conversation
...es/IntegrationTest/testExportPgFromQueriesWithStaggeredResults/results/airport/airport-1.csv
Show resolved
Hide resolved
Based on our discussion of how export works it seems like the solution you have might break if the map query has results that progressively add more columns to the CSV. This is what the rewrite step of inferredSchema fixes. So, I think we need to run the inferredSchema rewrite even on mapped queries which don't provide a label. Perhaps defaulting a schema label when none is provided. This would execute the rewrite logic on the intermediate file necessary to make the final result valid CSV. |
I like this suggestion. It's worth noting that the concern regarding queries which produce what I will call non-uniform maps (maps with extra or missing keys) is nothing new, and not made any worse or better by the changes in this PR (schema inference and csv rewrites never worked for queries exports as the results have no label and thus no MasterLabelSchema gets generated). That said I think there is an opportunity here to potentially build an analogue to I will look into it and see if I can get it to work. |
src/main/java/com/amazonaws/services/neptune/ExportPropertyGraphFromGremlinQueries.java
Show resolved
Hide resolved
src/main/java/com/amazonaws/services/neptune/propertygraph/schema/ExportSpecification.java
Show resolved
Hide resolved
src/main/java/com/amazonaws/services/neptune/propertygraph/schema/MasterLabelSchemas.java
Show resolved
Hide resolved
2f7312b
to
ae81e21
Compare
Note: The integration tests currently lack the ability to verify any JSON-based output formats. I have tested a queries-export from air-routes using |
Headers were previously not included as inferSchema was always configured to true. The result of this is that query results were initially written to headerless csv files, which in theory should have been later rewritten to files with headers. However, as there is no schema to rewrite the files to, it gets skipped and users are left with the headerless files. This is fixed by setting inferSchema to false for queries exports without structuredOutput. This way, the results are written straight to a proper csv file with a header.
ae81e21
to
c86a1a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue #, if available:
Background:
Headers have historically always been missing in
export-pg-from-queries
jobs. The underlying reason for this is that schema inference has always technically been enabled for queries exports, however queries exports lacked the ability to accumulate a schema as results come in.When schema inference is enabled, Neptune Export will stream results as they come in to an interim csv file, while also updating an internal schema which describes all of the data seen so far. Once all of the results have been received, a rewrite process is triggered which reads in the interim CSV files, and rewrites them to conform to the discovered schema. The interim csv files have no headers (as they are irrelevant without a complete schema), and may have some inconsistencies such as later rows having more columns than previous rows. These issues are corrected during the rewrite process.
The issue with
export-pg-from-queries
was that it was streaming results to interim csv files as expected, but the rewrite process was effectively a no-op as there was no accumulated schema to drive the re-write. In current versions of Neptune Export, the final csv files output fromexport-pg-from-queries
are the pre-rewrite interim csv files without any headers or guaranteed column alignment.Description of changes:
This PR addresses the above issues by creating a new GraphElementType.queryResults and properly tracking queries stats in a new FileSpecificLabelSchema for these queriesResults. These schemas use the name of each named query in the export job in place of the Label from PG results. This is aligned to how output files are named and formatted.
With the proper tracking of schemas for queriesResults, the rewrite process now functions correctly which results in a header being included when expected and any inconsistencies in the interim csv due to "staggered" results being corrected.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.