Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in Google Sheets column droped if first value is null #591

Open
korbash opened this issue Dec 27, 2024 · 3 comments
Open

in Google Sheets column droped if first value is null #591

korbash opened this issue Dec 27, 2024 · 3 comments

Comments

@korbash
Copy link

korbash commented Dec 27, 2024

dlt version

1.4.1

Source name

google_sheets

Describe the problem

Issue:

When working with this table in Google Sheets:

Col0 Col1
a
b c

The second column (Col1) droped by source.

What should happen:

Both columns (Col0 and Col1) should keep their data as entered.

What actually happens:

The second column (Col1) gets dropped.

Expected behavior

No response

Steps to reproduce

load this csv to google sheets
col0, col1
a,
b, c
then read it with
google_spreadsheet(
spreadsheet_url_or_id=url,
credentials=creds,
range_names=["list1!a1:b3"]
)

How you are using the source?

I run this source in production.

Operating system

Linux

Runtime environment

Local

Python version

3.12.3

dlt destination

snowflake

Additional information

No response

@korbash korbash changed the title in Google Sheets drop column if first value is null in Google Sheets column droped if first value is null Dec 27, 2024
@neuromantik33
Copy link

neuromantik33 commented Jan 16, 2025

I have also reproduced this bug. The problem lies in this block of code:

# first row contains headers and is skipped
data_row_metadata = metadata["rowData"][1]["values"]
rows_data = values[1:]

The problem is that it looks at the first row to get the data_row_metadata. If the 1st data row is smaller than the header row, as it is in my case, the column is silently ignored.

Image

Unfortunately this trick cannot work. It would be probably best to simply find the 1st row with all columns, or pad the data_types so that it's length is equal to the number of headers.

@neuromantik33
Copy link

If it helps, this is my very ugly workaround 😅

diff --git a/intra_dlt/helpers/google_sheets/__init__.py b/intra_dlt/helpers/google_sheets/__init__.py
index 5bdb623..5a5db60 100644
--- a/intra_dlt/helpers/google_sheets/__init__.py
+++ b/intra_dlt/helpers/google_sheets/__init__.py
@@ -138,6 +138,10 @@ def google_spreadsheet(
             rows_data = values[1:]
 
         data_types = get_data_types(data_row_metadata)
+        len_diff = len(headers) - len(data_types)
+        if len_diff > 0:
+            for i in range(len_diff):
+                data_types.append(None)
 
         yield dlt.resource(
             process_range(rows_data, headers=headers, data_types=data_types),

@rudolfix
Copy link
Contributor

@neuromantik33 this will prevent an error but also prevent getting correct metadata for the column where the value is missing. A better solution would be to take more metadata rows and find one best matching the list of columns. This is OFC way more work than fix you do here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Planned
Development

No branches or pull requests

3 participants