Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing headers in csv output and enable schema inference in export-pg-from-queries #174

Merged
merged 2 commits into from
Mar 31, 2025

Conversation

Cole-Greer
Copy link
Collaborator

@Cole-Greer Cole-Greer commented Mar 22, 2025

Issue #, if available:

Background:

Headers have historically always been missing in export-pg-from-queries jobs. The underlying reason for this is that schema inference has always technically been enabled for queries exports, however queries exports lacked the ability to accumulate a schema as results come in.

When schema inference is enabled, Neptune Export will stream results as they come in to an interim csv file, while also updating an internal schema which describes all of the data seen so far. Once all of the results have been received, a rewrite process is triggered which reads in the interim CSV files, and rewrites them to conform to the discovered schema. The interim csv files have no headers (as they are irrelevant without a complete schema), and may have some inconsistencies such as later rows having more columns than previous rows. These issues are corrected during the rewrite process.

The issue with export-pg-from-queries was that it was streaming results to interim csv files as expected, but the rewrite process was effectively a no-op as there was no accumulated schema to drive the re-write. In current versions of Neptune Export, the final csv files output from export-pg-from-queries are the pre-rewrite interim csv files without any headers or guaranteed column alignment.

Description of changes:

This PR addresses the above issues by creating a new GraphElementType.queryResults and properly tracking queries stats in a new FileSpecificLabelSchema for these queriesResults. These schemas use the name of each named query in the export job in place of the Label from PG results. This is aligned to how output files are named and formatted.

With the proper tracking of schemas for queriesResults, the rewrite process now functions correctly which results in a header being included when expected and any inconsistencies in the interim csv due to "staggered" results being corrected.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@kmcginnes
Copy link

Based on our discussion of how export works it seems like the solution you have might break if the map query has results that progressively add more columns to the CSV. This is what the rewrite step of inferredSchema fixes.

So, I think we need to run the inferredSchema rewrite even on mapped queries which don't provide a label. Perhaps defaulting a schema label when none is provided. This would execute the rewrite logic on the intermediate file necessary to make the final result valid CSV.

@Cole-Greer
Copy link
Collaborator Author

Based on our discussion of how export works it seems like the solution you have might break if the map query has results that progressively add more columns to the CSV. This is what the rewrite step of inferredSchema fixes.

So, I think we need to run the inferredSchema rewrite even on mapped queries which don't provide a label. Perhaps defaulting a schema label when none is provided. This would execute the rewrite logic on the intermediate file necessary to make the final result valid CSV.

I like this suggestion. It's worth noting that the concern regarding queries which produce what I will call non-uniform maps (maps with extra or missing keys) is nothing new, and not made any worse or better by the changes in this PR (schema inference and csv rewrites never worked for queries exports as the results have no label and thus no MasterLabelSchema gets generated). That said I think there is an opportunity here to potentially build an analogue to MasterLabelSchema which keys off of the query name instead of a label. If this works it might enable proper schema inference for queries results which would improve stability for any queries which are producing non-uniform results.

I will look into it and see if I can get it to work.

@Cole-Greer Cole-Greer marked this pull request as draft March 28, 2025 21:03
@Cole-Greer Cole-Greer force-pushed the queries-csv-with-headers branch from 2f7312b to ae81e21 Compare March 28, 2025 21:24
@Cole-Greer
Copy link
Collaborator Author

Note: The integration tests currently lack the ability to verify any JSON-based output formats. I have tested a queries-export from air-routes using g.V().hasLabel('airport').has('runways', gt(2)).project('code', 'runways', 'city', 'country').by('code').by('runways').by('city').by('country') with --format json and manually verified that the output remained correct (json is not subject to the rewriting process as each record does not need to conform to a single schema). Neither of the NeptuneStreams-based json formats have ever been supported for export-pg-from-queries, I have confirmed that they continue to fail with the same "NotSupported" exceptions as always.

@Cole-Greer Cole-Greer changed the title Fix missing headers in csv output for export-pg-from-queries Fix missing headers in csv output and enable schema inference in export-pg-from-queries Mar 28, 2025
Headers were previously not included as inferSchema was always configured to true. The result of this is that query results were initially written to headerless csv files, which in theory should have been later rewritten to files with headers. However, as there is no schema to rewrite the files to, it gets skipped and users are left with the headerless files. This is fixed by setting inferSchema to false for queries exports without structuredOutput. This way, the results are written straight to a proper csv file with a header.
@Cole-Greer Cole-Greer force-pushed the queries-csv-with-headers branch from ae81e21 to c86a1a3 Compare March 28, 2025 22:11
@Cole-Greer Cole-Greer marked this pull request as ready for review March 28, 2025 22:17
Copy link

@kmcginnes kmcginnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Cole-Greer Cole-Greer merged commit bf11f69 into develop Mar 31, 2025
4 checks passed
@Cole-Greer Cole-Greer deleted the queries-csv-with-headers branch March 31, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants