Skip to content

Conversation

Braalfa
Copy link

@Braalfa Braalfa commented Aug 5, 2025

Feature or Bugfix

  • Bugfix

Detail

  • When calling the function to_iceberg with a DataFrame that has new columns in it, the process adds the new columns to the schema, but doesn't upload the values. Therefore, a second call to the function is needed to actually upload the new columns values. This happens because the statement df = df[catalog_cols] removes the values of new columns in the DataFrame.

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@kukushking
Copy link
Contributor

Hi @Braalfa thank you for spotting this. We have a test case for new columns but it seems to only check the length, not the vallues. Would you mind to also update the test case to verify the fix is working as intended?

test_athena_to_iceberg_schema_evolution_add_columns

@Braalfa
Copy link
Author

Braalfa commented Sep 5, 2025

Hi @kukushking! Thanks for reviewing my PR.

I included a DataFrame comparison in the test to make sure the remote values are as expected. Also, I made some additional changes to the test: I removed the column c1 from the second uploaded DataFrame because the issue that I'm solving only happens when uploading a DataFrame with missing_columns with respect to the remote table, which triggers the execution of the buggy code:

if fill_missing_columns_in_df and schema_differences["missing_columns"]:
for col_name, col_type in schema_differences["missing_columns"].items():
df[col_name] = None
df[col_name] = df[col_name].astype(_data_types.athena2pandas(col_type))
schema_differences["missing_columns"] = {}
# Ensure that the ordering of the DF is the same as in the catalog.
# This is required for the INSERT command to work.
df = df[catalog_cols]

Also, I used c0 as an ordering index, because the read_sql_table may return the rows in any order.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 5f561c5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: ddeee46
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 9b7e614
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 5f561c5
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: ddeee46
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 9b7e614
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants