-
Notifications
You must be signed in to change notification settings - Fork 323
feat: update BigQueryClient methods #2273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if user_request is not None and identifier_value is not None: | ||
raise ValueError( | ||
f"Provide either a request object or '{identifier_name}', not both." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, there are cases where we merge these two in the existing hand-written client. For example, load jobs can take a string as a destination but merge in the job config object to the final request:
python-bigquery/google/cloud/bigquery/client.py
Lines 2577 to 2590 in ef2740a
def load_table_from_file( | |
self, | |
file_obj: IO[bytes], | |
destination: Union[Table, TableReference, TableListItem, str], | |
rewind: bool = False, | |
size: Optional[int] = None, | |
num_retries: int = _DEFAULT_NUM_RETRIES, | |
job_id: Optional[str] = None, | |
job_id_prefix: Optional[str] = None, | |
location: Optional[str] = None, | |
project: Optional[str] = None, | |
job_config: Optional[LoadJobConfig] = None, | |
timeout: ResumableTimeoutType = DEFAULT_TIMEOUT, | |
) -> job.LoadJob: |
I actually haven't thought too much about how the non-query jobs fit into this design, though. I suppose the user needs to specify more than just an identifier for all of the job types, so this method wouldn't apply?
Note: In addition to query jobs, load jobs using jobs.insert REST API will need a bit of handwritten magic to support load from local data via "resumable media uploads" (https://cloud.google.com/bigquery/docs/reference/api-uploads). I imagine we're planning on providing a separate hand-written helper for this, similar to queries? Actually do we know if the GAPICs even support the resumable upload API? AFAIK, it's only used in BigQuery, Cloud Storage, and Google Drive APIs. CC @parthea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to handle the Query experience is being designed by someone else and is not fully fleshed out for the python libraries.
I would like to defer this as out of scope for the alpha release.
DatasetIdentifier = Union[str, dataset_reference.DatasetReference] | ||
|
||
# TODO: This variable is here to simplify prototyping, etc. | ||
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: When using google.auth.default()
, it will read this variable for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That upgrade is handled in this separate PR.
self._clients: Dict[str, object] = {} | ||
self._credentials = credentials | ||
self._client_options = client_options | ||
self.project = PROJECT_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to upgrade this to a @property
so that it can be included in public documentation (and also be suggested to be read-only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That upgrade is handled in this separate PR.
""" | ||
|
||
self._clients: Dict[str, object] = {} | ||
self._credentials = credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to call credentials, default_project_id = google.auth.default()
(https://googleapis.dev/python/google-auth/latest/reference/google.auth.html#google.auth.default) if the credentials aren't set so that we don't repeat the auth flow for each sub-client construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That upgrade is handled in this separate PR.
client_options: | ||
A dictionary of client options to pass to the underlying | ||
service clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type above also allows for a single client_options
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the alpha, I may leave out the client_options
. It will depend on schedule. If we keep it, I will ensure that this nuance is captured in the code.
def __init__( | ||
self, | ||
*, | ||
credentials: Optional[auth_credentials.Credentials] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past, project
is a pretty common input to the client constructor. This is especially important in environments like https://colab.research.google.com that don't have a project associated with the environment.
Likewise, client_info
is another important one. That's how applications can amend the user-agent, which is very important for BQ to track usage attributable to open source "connectors" as well as partner company usage. I suppose we could guide such users to use the raw clients then, but if so maybe client_options
falls in the same bucket of "advanced" features and should be excluded?
Speaking of commonly used arguments: Fun fact, the Kaggle team was (is?) using _http
to provide a custom proxy to BigQuery for their special Kaggle free tier that included BQML, unlike the BQ Sandbox. Might be worth reaching out to them to come up with an alternative if they're still doing that. See: https://www.kaggle.com/discussions/general/50159 Note that 5 TB is 5x more than the BQ Sandbox 1 TB. That post was from 8 years ago, though, so I don't know if that still applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these insights into what customers commonly pass into the client constructor.
I will see what I can to do include project
into the client constructor.
I may reserve client_info
til the beta or release candidates due to schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those who are reading, this may provide some good context:
Right now, if you ran the GAPIC
generated code and wanted to get_dataset()
, this is what that would look like:
# Create a client
from google.cloud.bigquery_v2 import bigquery_v2
client = bigquery_v2.DatasetServiceClient() # default DatasetServiceClient
# Initialize request argument(s)
request = bigquery_v2.GetDatasetRequest(
project_id="project_id_value",
dataset_id="dataset_id_value",
)
# Make the request
# NOTE: due to the protobuf definition, there is no way to
# provide a "project_id_value.dataset_id_value" string to this method.
response = client.get_dataset(request=request)
As part of this alpha, we are trying to enable one basic transmogrification: allow a user to continue to be able to supply a "project_id_value.dataset_id_value" string to the method (if this proves useful and universally generatable, other convenience transformers will follow).
This is done by injecting several helper functions that can invisibly accept the string and create a *Request
object for the user.
# Create a client
from google.cloud.bigquery_v2 import bigquery_v2
bqclient = bigquery_v2.BigQueryClient() # my new and improved hotness
# Initialize the string value
dataset_id = "project_id_value.dataset_id_value"
# Make the request
# Inside the centralized_client version of get_dataset, it accepts
# a "project_id_value.dataset_id_value" string and the helper functions
# break it apart and create a bigquery_v2.GetDatasetRequest that is passed
# to the dataset_service_client's version of get_dataset.
# This happens invisibly to the user.
response = bqclient.get_dataset(dataset_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including project
is handled in this separate PR.
For the moment, I have not made any attempts to also process client_info
.
google/cloud/bigquery_v2/services/centralized_service/client.py
Outdated
Show resolved
Hide resolved
""" | ||
if isinstance(dataset_id, str): | ||
project_id, dataset_id_str = self._parse_dataset_path(dataset_id) | ||
return {"project_id": project_id, "dataset_id": dataset_id_str} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to see snake_case
here. IIRC the REST API mostly uses camel-case, but per https://protobuf.dev/programming-guides/json/ both should be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not information that is sent directly to the API.
It's a dictionary shared between helper functions.
As noted above, we send a *Request
object per the DatasetServiceClient.get_dataset()
method signature. The helper functions accept a string and invisibly create a *Request
object for the user.
I will include a note to this effect in the docstring of the helper funcs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if isinstance(dataset_id, str): | ||
project_id, dataset_id_str = self._parse_dataset_path(dataset_id) | ||
return {"project_id": project_id, "dataset_id": dataset_id_str} | ||
elif isinstance(dataset_id, dataset_reference.DatasetReference): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, can't we use the https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html#google.protobuf.json_format.MessageToDict or the proto-plus equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not intended to be sent to the API. It is internal use only.
|
||
def _parse_dataset_id_to_dict(self, dataset_id: DatasetIdentifier) -> dict: | ||
""" | ||
Helper to create a request dictionary from a project_id and dataset_id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By request dictionary, do we mean the canonical JSON representation of the protobuf? https://protobuf.dev/programming-guides/json/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Internal use only. Passed between helper functions, will update the docstrings.
Co-authored-by: Tim Sweña (Swast) <[email protected]>
A tuple of (project_id, dataset_id). | ||
""" | ||
if "." in dataset_path: | ||
# Use rsplit to handle legacy paths like `google.com:my-project.my_dataset`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, is google.com:my-project.my_dataset
supposed to yield ['google.com:my-project', 'my_dataset']
or ['my-project', 'my_dataset']
?
def list_datasets( | ||
self, | ||
request: Optional[Union[dataset.ListDatasetsRequest, dict]] = None, | ||
project_id: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is project_id
before the star sign and request
after? Is there any special consideration about order here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can separate the changes in this file related to test setup to a new PR, and use this PR just for things related to the CRUD methods for datasets.
Overtaken by events. Superceded by another PR. |
This PR is overtaken by events. Other PRs are going to be used to include this content. Closing. |
This PR:
project_id.dataset_id
*SERVICECLIENT
attributes*SERVICECLIENTs
to allow discussion about the approach. The methods take in either a request object OR a convenience string as an argument and transmogrify the input so that the underlying service client attribute can be called directly. e.g.:get_dataset()
list_datasets()
NOTES
autogen
dev branch.