-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: REST Scan Planning Task Implementation #13400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
3624aee to
c55e0e4
Compare
c55e0e4 to
0f225bf
Compare
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
8f6e519 to
49a1392
Compare
49a1392 to
d752e0a
Compare
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still going through things but I do believe this works at least from a correctness perspective. I still would need to give some more thought as to cancellation, client/server backpressure and how this would fit in for engines which immediately start task consumption/execution during planning (like Trino)
|
Thank you for the review, presently i just made E2E machinery work, with later parser changes, i think in addition to the points you mentioned, i was thinking of more also from the POV :
|
52b3957 to
e3de068
Compare
|
Update : Working on refactoring this a bit more.
My understanding was ParallelIterable could be helpful here as its aware of the both consumer and the producer, iceberg/core/src/main/java/org/apache/iceberg/util/ParallelIterable.java Lines 200 to 205 in be03c99
Agree need to think this more thoroughly also from the server POV, from my cursory reading of Trino source code (I am fairly new to it) https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java#L149 split generation and consuming it, should mostly work, as this seems like this is built in engine itself. for engine which needs all the splits computed first we have no choice but to consume everything. |
540a4d6 to
08b1ce4
Compare
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
69d6ac1 to
b85d8e6
Compare
| case SUBMITTED: | ||
| try { | ||
| // TODO: if we want to add some jitter here to avoid thundering herd. | ||
| Thread.sleep(FETCH_PLANNING_SLEEP_DURATION_MS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should maybe try and avoid "busy waiting" here and instead use a higher level concurrency concept to achieve the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this entire block should be using the functionality from Tasks where you can also specify an exponential backoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tasks retry behaviour is wired to exception handling, i can throw a inprogress exception to make it work with Task, though i think throwing an exception might not be the right thing to do.
failsafe has good retry policy i see in aws module we do use that please let me know if you are open to introducing this in core and then we can use lib to do exponential backoff.
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another pass and left a few more comments, focusing mostly on RESTTable and RESTTableScan.
I believe we also need some tests with a REST server that doesn't support the new endpoints and where we verify that the correct errors are coming back depending on which endpoints the server supports. Basically you'd have a custom Adapter which would throw or not when a particular endpoint is called
Maybe we should extract all of the server-side scan planning tests into a separate class, because TestRESTCatalog is hitting almost 4k LOC.
e41f25c to
e63d45e
Compare
1e53999 to
40c9501
Compare
40c9501 to
01d4a24
Compare
🥞 Stacked PR
Please use this link for viewing incremental changes.
Current Stack status:
* Add Rest Model and Parsers [Merged]
About the change
[1] Add routes for the scan Planning API
[2] Adds RestTable and RestTableScan which can call the scan endpoint
[3] ScanIterable which uses ParallelIterable to fetch scan tasks
Testing
Added unit testing for the E2E req response loop