Skip to content

add nulls parse option in readExcel #1247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 16, 2025
Merged

add nulls parse option in readExcel #1247

merged 4 commits into from
Jun 16, 2025

Conversation

AndreiKingsley
Copy link
Collaborator

Fixes #1166.

@@ -255,6 +271,8 @@ public fun DataFrame.Companion.readExcel(
* when set to false, it operates as [NameRepairStrategy.MAKE_UNIQUE],
* ensuring unique column names will make the columns be named according to excel columns, like "A", "B", "C" etc.
* for unstructured data.
* @param parseEmptyAsNull when set to true, empty strings in cells are parsed as null (default true).
Copy link
Collaborator

@Jolanrensen Jolanrensen Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be more representations in strings that can refer to null. maybe it might be better to follow readCsv and other parsing options for this. It will be overkill to add the full ParserOptions as argument (since datetime and locales are already handled by excel), but csv, for instance uses nullStrings: Set<String>= setOf("", "NA", "N/A", "null", "NULL", "None", "none", "NIL", "nil"), so it might make sense to add a nullStrings argument here as well, probably with fewer defaults

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue:
#1248

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think is a good idea to introduce a new non-trivial logic just before release, need more time for implementation and testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sure, makes sense

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speaking off, this is a binary-breaking change. Since we released a beta already, we cannot break the API like that without a fallback mechanism. Could you add the old function signatures back with @Deprecated() at level HIDDEN? That way, we won't break libraries calling the old function :)

@@ -59,6 +59,9 @@ public class Excel : SupportedDataFrameFormat {
DefaultReadExcelMethod(pathRepresentation)
}

private const val MESSAGE_1_1 = "Will be ERROR in 1.1."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIDDEN >= ERROR already. You can remove this message or say that they are removed in 1.1

rowsCount,
nameRepairStrategy,
firstRowIsHeader,
parseEmptyAsNull,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 5 cents: how about enum parameter "emptyStringValue" with NULL, KEEP_AS_STRING options?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, if we're looking to extend this with other null strings #1248, it could be some sort of sealed interface (String) -> String? with companion object { val NULL = ...; val KEEP_AS_STRING ... }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like something like this too, but it would deviate even more for how we handle parsing String -> null with parse(), ParserOptions, and readCsv(). They now take a simple set of null-representing strings.
Maybe we could change it all to the same system at some point, but until then I'd keep this one as simple as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about enum then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an enum could work too indeed, as long as we can make a simple @Deprecated ReplaceWith eventually, I'm fine with it :)

@AndreiKingsley AndreiKingsley merged commit f5707a8 into master Jun 16, 2025
5 checks passed
@AndreiKingsley AndreiKingsley deleted the xlsx_null_parse branch June 16, 2025 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect XLSX parsing
3 participants