Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Jun 26, 2025

🥞 Stacked PR

Please use this link for viewing incremental changes.

Current Stack status:

About the change

[1] Add routes for the scan Planning API
[2] Adds RestTable and RestTableScan which can call the scan endpoint
[3] ScanIterable which uses ParallelIterable to fetch scan tasks

Testing

Added unit testing for the E2E req response loop

@github-actions github-actions bot added the core label Jun 26, 2025
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from 3624aee to c55e0e4 Compare June 27, 2025 15:23
@singhpk234 singhpk234 closed this Jun 27, 2025
@singhpk234 singhpk234 reopened this Jun 27, 2025
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from c55e0e4 to 0f225bf Compare July 11, 2025 23:50
@singhpk234 singhpk234 closed this Jul 12, 2025
@singhpk234 singhpk234 reopened this Jul 12, 2025
@singhpk234 singhpk234 marked this pull request as ready for review July 13, 2025 18:46
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch 2 times, most recently from 8f6e519 to 49a1392 Compare August 15, 2025 23:57
@singhpk234 singhpk234 marked this pull request as draft August 15, 2025 23:57
@singhpk234 singhpk234 marked this pull request as ready for review August 17, 2025 09:56
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from 49a1392 to d752e0a Compare August 17, 2025 20:22
@amogh-jahagirdar amogh-jahagirdar changed the title Part 2: Integrate Scan Planning to Core Core: REST Scan Planning Task Implementation Aug 18, 2025
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still going through things but I do believe this works at least from a correctness perspective. I still would need to give some more thought as to cancellation, client/server backpressure and how this would fit in for engines which immediately start task consumption/execution during planning (like Trino)

@singhpk234
Copy link
Contributor Author

singhpk234 commented Aug 18, 2025

Thank you for the review, presently i just made E2E machinery work, with later parser changes, i think in addition to the points you mentioned, i was thinking of more also from the POV :

  1. can server force client to call scan plan api ?
  2. can server tell the client what should be interval between the fetch scan tasks ? (i think you refer that as back-pressure in above comment of yours)
  3. I don't know but i keep coming to this feature [CORE] Support file filtering based on schema #4842 , it may be orthogonal
    but since for schema evolution, we certainly don't wanna send back data file objects which certainly don't have the column

@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from 52b3957 to e3de068 Compare August 29, 2025 19:03
@singhpk234
Copy link
Contributor Author

singhpk234 commented Sep 4, 2025

Update : Working on refactoring this a bit more.

client/server backpressure and how this would fit in for engines which immediately start task consumption/execution during planning (like Trino)

My understanding was ParallelIterable could be helpful here as its aware of the both consumer and the producer,
and handle backpressure via yields

// If the consumer is processing records more slowly than the producers, the producers will
// eventually fill the queue and yield, returning continuations. Continuations and new tasks
// are started by checkTasks(). The check here prevents us from restarting continuations or
// starting new tasks before the queue is emptied. Restarting too early would lead to tasks
// yielding very quickly (CPU waste on scheduling).
if (!queue.isEmpty()) {

Agree need to think this more thoroughly also from the server POV,

from my cursory reading of Trino source code (I am fairly new to it) https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java#L149 split generation and consuming it, should mostly work, as this seems like this is built in engine itself. for engine which needs all the splits computed first we have no choice but to consume everything.

@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch 2 times, most recently from 540a4d6 to 08b1ce4 Compare September 9, 2025 23:52
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch 2 times, most recently from 69d6ac1 to b85d8e6 Compare September 16, 2025 22:39
case SUBMITTED:
try {
// TODO: if we want to add some jitter here to avoid thundering herd.
Thread.sleep(FETCH_PLANNING_SLEEP_DURATION_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should maybe try and avoid "busy waiting" here and instead use a higher level concurrency concept to achieve the same thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this entire block should be using the functionality from Tasks where you can also specify an exponential backoff

Copy link
Contributor Author

@singhpk234 singhpk234 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks retry behaviour is wired to exception handling, i can throw a inprogress exception to make it work with Task, though i think throwing an exception might not be the right thing to do.
failsafe has good retry policy i see in aws module we do use that please let me know if you are open to introducing this in core and then we can use lib to do exponential backoff.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass and left a few more comments, focusing mostly on RESTTable and RESTTableScan.

I believe we also need some tests with a REST server that doesn't support the new endpoints and where we verify that the correct errors are coming back depending on which endpoints the server supports. Basically you'd have a custom Adapter which would throw or not when a particular endpoint is called

Maybe we should extract all of the server-side scan planning tests into a separate class, because TestRESTCatalog is hitting almost 4k LOC.

@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch 4 times, most recently from e41f25c to e63d45e Compare November 26, 2025 03:39
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch 2 times, most recently from 1e53999 to 40c9501 Compare November 27, 2025 00:00
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch from 40c9501 to 01d4a24 Compare November 27, 2025 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants