Skip to content

Conversation

chalmerlowe
Copy link
Collaborator

@chalmerlowe chalmerlowe commented Aug 28, 2025

This PR:

  • Adds several helper functions that enable processing of convenience strings (e.g. project_id.dataset_id
  • Updates the setters and getters for the *SERVICECLIENT attributes
  • Updates docstrings and comments across the board
  • Updates TWO methods associated with the *SERVICECLIENTs to allow discussion about the approach. The methods take in either a request object OR a convenience string as an argument and transmogrify the input so that the underlying service client attribute can be called directly. e.g.:
    • get_dataset()
    • list_datasets()

NOTES

  1. Getting the approach to these service clients and methods is important because the microgenerator will be updated to create seven clients and 30+ methods.
  2. The GitHub Actions are being used to help ensure that unit tests pass.
    Screenshot 2025-08-20 at 9 14 08 AM
  3. The KOKORO tests are failing. This is a known problem and will be dealt with in a separate PR. It should not affect merging into the autogen dev branch.
    Screenshot 2025-08-20 at 9 13 45 AM

@chalmerlowe chalmerlowe requested review from a team as code owners August 28, 2025 15:42
@chalmerlowe chalmerlowe requested review from leahecole and removed request for a team August 28, 2025 15:42
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery API. labels Aug 28, 2025
Comment on lines +83 to +86
if user_request is not None and identifier_value is not None:
raise ValueError(
f"Provide either a request object or '{identifier_name}', not both."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there are cases where we merge these two in the existing hand-written client. For example, load jobs can take a string as a destination but merge in the job config object to the final request:

def load_table_from_file(
self,
file_obj: IO[bytes],
destination: Union[Table, TableReference, TableListItem, str],
rewind: bool = False,
size: Optional[int] = None,
num_retries: int = _DEFAULT_NUM_RETRIES,
job_id: Optional[str] = None,
job_id_prefix: Optional[str] = None,
location: Optional[str] = None,
project: Optional[str] = None,
job_config: Optional[LoadJobConfig] = None,
timeout: ResumableTimeoutType = DEFAULT_TIMEOUT,
) -> job.LoadJob:

I actually haven't thought too much about how the non-query jobs fit into this design, though. I suppose the user needs to specify more than just an identifier for all of the job types, so this method wouldn't apply?

Note: In addition to query jobs, load jobs using jobs.insert REST API will need a bit of handwritten magic to support load from local data via "resumable media uploads" (https://cloud.google.com/bigquery/docs/reference/api-uploads). I imagine we're planning on providing a separate hand-written helper for this, similar to queries? Actually do we know if the GAPICs even support the resumable upload API? AFAIK, it's only used in BigQuery, Cloud Storage, and Google Drive APIs. CC @parthea

Copy link
Collaborator Author

@chalmerlowe chalmerlowe Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to handle the Query experience is being designed by someone else and is not fully fleshed out for the python libraries.

I would like to defer this as out of scope for the alpha release.

DatasetIdentifier = Union[str, dataset_reference.DatasetReference]

# TODO: This variable is here to simplify prototyping, etc.
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That upgrade is handled in this separate PR.

self._clients: Dict[str, object] = {}
self._credentials = credentials
self._client_options = client_options
self.project = PROJECT_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to upgrade this to a @property so that it can be included in public documentation (and also be suggested to be read-only).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That upgrade is handled in this separate PR.

"""

self._clients: Dict[str, object] = {}
self._credentials = credentials
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to call credentials, default_project_id = google.auth.default() (https://googleapis.dev/python/google-auth/latest/reference/google.auth.html#google.auth.default) if the credentials aren't set so that we don't repeat the auth flow for each sub-client construction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That upgrade is handled in this separate PR.

Comment on lines +79 to +81
client_options:
A dictionary of client options to pass to the underlying
service clients.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type above also allows for a single client_options.

Copy link
Collaborator Author

@chalmerlowe chalmerlowe Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the alpha, I may leave out the client_options. It will depend on schedule. If we keep it, I will ensure that this nuance is captured in the code.

def __init__(
self,
*,
credentials: Optional[auth_credentials.Credentials] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, project is a pretty common input to the client constructor. This is especially important in environments like https://colab.research.google.com that don't have a project associated with the environment.

Likewise, client_info is another important one. That's how applications can amend the user-agent, which is very important for BQ to track usage attributable to open source "connectors" as well as partner company usage. I suppose we could guide such users to use the raw clients then, but if so maybe client_options falls in the same bucket of "advanced" features and should be excluded?

Speaking of commonly used arguments: Fun fact, the Kaggle team was (is?) using _http to provide a custom proxy to BigQuery for their special Kaggle free tier that included BQML, unlike the BQ Sandbox. Might be worth reaching out to them to come up with an alternative if they're still doing that. See: https://www.kaggle.com/discussions/general/50159 Note that 5 TB is 5x more than the BQ Sandbox 1 TB. That post was from 8 years ago, though, so I don't know if that still applies.

Copy link
Collaborator Author

@chalmerlowe chalmerlowe Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these insights into what customers commonly pass into the client constructor.
I will see what I can to do include project into the client constructor.
I may reserve client_info til the beta or release candidates due to schedule.

Copy link
Collaborator Author

@chalmerlowe chalmerlowe Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those who are reading, this may provide some good context:

Right now, if you ran the GAPIC generated code and wanted to get_dataset(), this is what that would look like:

# Create a client
from google.cloud.bigquery_v2 import bigquery_v2
client = bigquery_v2.DatasetServiceClient() # default DatasetServiceClient

# Initialize request argument(s)
request = bigquery_v2.GetDatasetRequest(
    project_id="project_id_value",
    dataset_id="dataset_id_value",
)

# Make the request
#     NOTE: due to the protobuf definition, there is no way to
#     provide a "project_id_value.dataset_id_value" string to this method.
response = client.get_dataset(request=request)

As part of this alpha, we are trying to enable one basic transmogrification: allow a user to continue to be able to supply a "project_id_value.dataset_id_value" string to the method (if this proves useful and universally generatable, other convenience transformers will follow).

This is done by injecting several helper functions that can invisibly accept the string and create a *Request object for the user.

# Create a client
from google.cloud.bigquery_v2 import bigquery_v2
bqclient = bigquery_v2.BigQueryClient() # my new and improved hotness

# Initialize the string value 
dataset_id = "project_id_value.dataset_id_value"

# Make the request
#     Inside the centralized_client version of get_dataset, it accepts
#     a "project_id_value.dataset_id_value" string and the helper functions
#     break it apart and create a bigquery_v2.GetDatasetRequest that is passed
#     to the dataset_service_client's version of get_dataset.
#     This happens invisibly to the user.

response = bqclient.get_dataset(dataset_id)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including project is handled in this separate PR.

For the moment, I have not made any attempts to also process client_info.

"""
if isinstance(dataset_id, str):
project_id, dataset_id_str = self._parse_dataset_path(dataset_id)
return {"project_id": project_id, "dataset_id": dataset_id_str}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to see snake_case here. IIRC the REST API mostly uses camel-case, but per https://protobuf.dev/programming-guides/json/ both should be acceptable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not information that is sent directly to the API.
It's a dictionary shared between helper functions.
As noted above, we send a *Request object per the DatasetServiceClient.get_dataset() method signature. The helper functions accept a string and invisibly create a *Request object for the user.

I will include a note to this effect in the docstring of the helper funcs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if isinstance(dataset_id, str):
project_id, dataset_id_str = self._parse_dataset_path(dataset_id)
return {"project_id": project_id, "dataset_id": dataset_id_str}
elif isinstance(dataset_id, dataset_reference.DatasetReference):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not intended to be sent to the API. It is internal use only.


def _parse_dataset_id_to_dict(self, dataset_id: DatasetIdentifier) -> dict:
"""
Helper to create a request dictionary from a project_id and dataset_id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By request dictionary, do we mean the canonical JSON representation of the protobuf? https://protobuf.dev/programming-guides/json/

Copy link
Collaborator Author

@chalmerlowe chalmerlowe Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Internal use only. Passed between helper functions, will update the docstrings.

A tuple of (project_id, dataset_id).
"""
if "." in dataset_path:
# Use rsplit to handle legacy paths like `google.com:my-project.my_dataset`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, is google.com:my-project.my_dataset supposed to yield ['google.com:my-project', 'my_dataset'] or ['my-project', 'my_dataset']?

def list_datasets(
self,
request: Optional[Union[dataset.ListDatasetsRequest, dict]] = None,
project_id: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is project_id before the star sign and request after? Is there any special consideration about order here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can separate the changes in this file related to test setup to a new PR, and use this PR just for things related to the CRUD methods for datasets.

@chalmerlowe
Copy link
Collaborator Author

Overtaken by events. Superceded by another PR.

@chalmerlowe
Copy link
Collaborator Author

This PR is overtaken by events. Other PRs are going to be used to include this content. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants