Enable querying policy-enabled table in MSQ, and use RestrictedDataSource as a base in DataSourceAnalysis. #17666

cecemei · 2025-01-25T00:43:25Z

Description

This PR enables querying policy-enabled table in MSQ.

Key changed/added classes in this PR

DataSourceAnalysis, getBaseTableDataSource can now return the base of RestrictedDataSource. This is a more robust solution than using the underlying table as base.
DruidQuery, can also be created by withPolicy, which just applies policy restriction to the original query.
MSQTaskQueryMaker would apply restrictions on DruidQuery, instead of throw permission error.
DataSourcePlan can handle RestrictedDataSource.
a new class RestrictedInputNumberDataSource, which basically wraps a NumberDataSource with a policy, and its SegmentMapFn can be used to create a RestrictedSegment.
RunWorkOrder, try to make a few refactors to make the code clear, no behavior change. ShufflePipelineBuilder.build(), it was not clear before that the channel future should only be returned when the resultFuture is ready. Also, the sanity check is moved to OutputChannels.
BaseLeafFrameProcessorFactory.makeProcessors(). Previously, makeSegmentMapFnProcessor can return null. After this PR, it cannot. The difference is that now the SegmentMapFn from query.dataSource is handled in ChainedProcessorManager for all query (was just join query before).

This PR has:

…le policy restriction in MSQ.

clintropolis

had a first pass and have some questions and thoughts.

Also, maybe you could try to avoid reformatting entire files, all of these unrelated formatting changes make review harder than it should be. I know its just the tooling doing it to adhere to the style stuff, but my preference at least would be to do these cosmetic changes as you notice them as standalone PR to keep reviews simple.

clintropolis · 2025-01-29T21:57:06Z

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/RunWorkOrder.java

   */
-  private <FactoryType extends FrameProcessorFactory<ProcessorReturnType, ManagerReturnType, ExtraInfoType>, ProcessorReturnType, ManagerReturnType, ExtraInfoType> void makeAndRunWorkProcessors()
-      throws IOException
+  private <FactoryT extends FrameProcessorFactory<ProcessorReturnT, ManagerReturnT, ExtraInfoT>, ProcessorReturnT, ManagerReturnT, ExtraInfoT>


nit: why change these names? just seems to add extra noise to the PR

clintropolis · 2025-01-29T22:00:10Z

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/RunWorkOrder.java

+    final int channelSize = outputChannels.getAllChannels().size();
+    final int parallelismBoundedByChannelSize = channelSize == 0 ? parallelism : Math.min(parallelism, channelSize);
+    final int maxOutstandingProcessors = Math.max(1, parallelismBoundedByChannelSize);


nit: the old code seemed clearer and had comments that were nice

clintropolis · 2025-01-29T23:21:03Z

...ore/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQWorkerTaskLauncher.java

-      workerToTaskIds.compute(i, (workerId, taskIds) -> {
-        if (taskIds == null) {
-          taskIds = new ArrayList<>();
-        }
-        taskIds.add(task.getId());
-        return taskIds;
-      });
+      workerToTaskIds.computeIfAbsent(i, (unused) -> (new ArrayList<>())).add(task.getId());


this isn't equivalent, previously it would always add the taskId to the worker, now it only adds if the worker isn't there, is that ok?

clintropolis · 2025-01-29T23:23:13Z

...ge-query/src/main/java/org/apache/druid/msq/querykit/BroadcastJoinSegmentMapFnProcessor.java

@@ -198,6 +198,7 @@ private Function<SegmentReference, SegmentReference> createSegmentMapFunction()

  DataSource inlineChannelData(final DataSource originalDataSource)
  {
+    // TODO: need to handle RestrictedInputNumberDataSource here


we typically don't leave TODO comments in the code, either do the thing or leave a bigger comment explaining the problem that needs addressed in the future

clintropolis · 2025-01-29T23:25:53Z

...stage-query/src/main/java/org/apache/druid/msq/querykit/RestrictedInputNumberDataSource.java

+ * join tree.
+ */
+@JsonTypeName("restrictedInputNumber")
+public class RestrictedInputNumberDataSource implements DataSource


should this be InputNumberRestrictedDataSource instead?

clintropolis · 2025-01-30T03:05:24Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

+   * Computes a native druid query, must be called from the constructor. The returned query will be one of following:
+   * <ul>
+   *   <li> {@link GroupByQuery}
+   *   <li> {@link WindowOperatorQuery}
+   *   <li> {@link TimeBoundaryQuery}
+   *   <li> {@link TimeseriesQuery}
+   *   <li> {@link TopNQuery}
+   *   <li> {@link ScanQuery}
+   * </ul>


nit: afaik we do not publish javadocs, so these could just be a plain list

clintropolis · 2025-01-30T03:17:02Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

+  /**
+   * Returns an updated {@link DruidQuery} based on the policy restrictions on tables.
+   */
+  public DruidQuery withPolicies(Map<String, Optional<Policy>> policyMap)


this method seems off to me, like why isn't dataSource updated as well? I think calling this makes a DruidQuery that is in a bit of a strange state, and I think it basically only chill with the way that MSQTaskQueryMaker is working since it only really modifies the query part of the DruidQuery.

I'm not sure what is better to do here, since it is maybe odd that MSQ is using DruidQuery directly for stuff while this is filled with planner stuff, need to think about it a bit.

clintropolis · 2025-01-30T03:29:12Z

...i-stage-query/src/main/java/org/apache/druid/msq/querykit/BaseLeafFrameProcessorFactory.java

-      if (query.getDataSource().getAnalysis().isJoin()) {
-        // Joins may require significant computation to compute the segmentMapFn. Offload it to a processor.
-        return new SimpleSegmentMapFnProcessor(query);
-      } else {
-        // Non-joins are expected to have cheap-to-compute segmentMapFn. Do the computation in the factory thread,
-        // without offloading to a processor.
-        return null;
-      }


why this change? I guess the result is that above we always need a ChainedProcessorManager, is that necessary?

clintropolis · 2025-01-30T03:35:29Z

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/RunWorkOrder.java

@@ -1187,6 +1149,11 @@ public OutputChannels getOutputChannels()
    {
      return outputChannels;
    }
+
+    public ListenableFuture<OutputChannels> waitResultReadyAndGetSanityCheckedChannels()


i think this could just be called waitResultReady

clintropolis · 2025-01-30T04:24:20Z

processing/src/main/java/org/apache/druid/frame/processor/OutputChannels.java

+  /**
+   * Verifies there is exactly one channel per partition.
+   */
+  public OutputChannels sanityCheck()


maybe we should call this verify() or verifySingleChannel()?

Enable RestrictedDataSource as a base in DataSourceAnalysis, and enab…

05df2a5

…le policy restriction in MSQ.

github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jan 25, 2025

mostly style things

22a794a

cecemei changed the title ~~Enable RestrictedDataSource as a base in DataSourceAnalysis, and enab…~~ Enable querying policy-enabled table in MSQ, and use RestrictedDataSource as a base in DataSourceAnalysis. Jan 25, 2025

cecemei added 4 commits January 24, 2025 18:14

fix bug order by non-time column not suppported

22f6fc2

more tests

51a86b4

more tests

e5b5128

Make RestrictedDataSource a base in DataSourceAnalysis, and more tests.

3011c5a

cecemei marked this pull request as ready for review January 28, 2025 04:08

clintropolis reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable querying policy-enabled table in MSQ, and use RestrictedDataSource as a base in DataSourceAnalysis. #17666

Enable querying policy-enabled table in MSQ, and use RestrictedDataSource as a base in DataSourceAnalysis. #17666

cecemei commented Jan 25, 2025 •

edited

Loading

clintropolis left a comment

clintropolis Jan 29, 2025

clintropolis Jan 29, 2025

clintropolis Jan 29, 2025

clintropolis Jan 29, 2025

clintropolis Jan 29, 2025

clintropolis Jan 30, 2025

clintropolis Jan 30, 2025

clintropolis Jan 30, 2025

clintropolis Jan 30, 2025

clintropolis Jan 30, 2025

Enable querying policy-enabled table in MSQ, and use RestrictedDataSource as a base in DataSourceAnalysis. #17666

Are you sure you want to change the base?

Enable querying policy-enabled table in MSQ, and use RestrictedDataSource as a base in DataSourceAnalysis. #17666

Conversation

cecemei commented Jan 25, 2025 • edited Loading

Description

Key changed/added classes in this PR

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cecemei commented Jan 25, 2025 •

edited

Loading