Skip to content

[core]python: support pfvs in python #5930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jul 24, 2025

Conversation

jerry-024
Copy link
Contributor

@jerry-024 jerry-024 commented Jul 21, 2025

Purpose

support pfvs in python: implements fsspec

Tests

  • PVFSTestCase.test_arrow
  • PVFSTestCase.test_ray
  • PVFSTestCase.test_api

API and Format

Documentation

@jerry-024 jerry-024 marked this pull request as draft July 21, 2025 03:38
@jerry-024 jerry-024 force-pushed the support_pvfs_python branch from 3b792ca to ddc23d1 Compare July 22, 2025 07:58
@jerry-024 jerry-024 marked this pull request as ready for review July 23, 2025 06:01
for table_name in self.rest_api.list_tables(database_name):
self.rest_api.drop_table(Identifier.create(database_name, table_name))
self.rest_api.drop_database(database_name)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here recursive drop database is not right. Rest api drop database will just delete database, should check table existence when recursive is false. We can take RESTCatalog.dropDatabase as reference.

for table_name in self.rest_api.list_tables(database_name):
self.rest_api.drop_table(Identifier.create(database_name, table_name))
self.rest_api.drop_database(database_name)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems rmdir should only support delete empty dir. Can we make a double check of its behavior?

self._create_object_table(pvfs_identifier)
except AlreadyExistsException:
pass
table = self.rest_api.get_table(table_identifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can optimize by first get_table, and then create if table not exist, in order to reduce api calls in most cases.

f"created is not supported for path: {path}"
)
elif isinstance(pvfs_identifier, PVFSDatabaseIdentifier):
return self.rest_api.get_database(pvfs_identifier.name).created_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://filesystem-spec.readthedocs.io/en/latest/api.html, created should return datetime.datetime, created_at is a long value in millisecond.

elif isinstance(pvfs_identifier, PVFSTableIdentifier):
table_identifier = Identifier.create(pvfs_identifier.database, pvfs_identifier.name)
if pvfs_identifier.sub_path is None:
return self.rest_api.get_table(table_identifier).created_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should convert to datetime.datetime.

f"modified is not supported for path: {path}"
)
elif isinstance(pvfs_identifier, PVFSDatabaseIdentifier):
return self.rest_api.get_database(pvfs_identifier.name).updated_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should convert to datetime.datetime.

elif isinstance(pvfs_identifier, PVFSTableIdentifier):
table_identifier = Identifier.create(pvfs_identifier.database, pvfs_identifier.name)
if pvfs_identifier.sub_path is None:
return self.rest_api.get_table(table_identifier).updated_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should convert to datetime.datetime.

@timmyyao
Copy link
Contributor

It is better to set user agent. In Java SDK, we set RESTCatalogOptions.HTTP_USER_AGENT as HadoopPVFS. So here maybe we can use PythonPVFS?

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename paimon_options.py to config.py. And remove pvfs_config.py.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 76f6223 into apache:master Jul 24, 2025
22 checks passed
@jerry-024 jerry-024 deleted the support_pvfs_python branch July 24, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants