DRILL-4706: Fragment planning causes Drillbits to read remote chunks when local copies are available. #639

ppadma · 2016-10-31T16:24:13Z

No description provided.

ppadma · 2016-10-31T22:37:20Z

Updated the JIRA with details on how current algorithm works, why remote reads were happening and the new algorithm details.
https://issues.apache.org/jira/browse/DRILL-4706

sudheeshkatkam · 2016-11-01T15:21:07Z

@vkorukanti if you don't mind, can you review this?

sohami

Yet to review "LocalAffinityFragmentParallelizer.java"

sohami · 2016-11-02T01:01:09Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/EndpointAffinity.java

@@ -75,6 +78,7 @@ public EndpointAffinity(final DrillbitEndpoint endpoint, final double affinity,
    this.affinity = affinity;
    this.mandatory = mandatory;
    this.maxWidth = maxWidth;
+    this.numLocalWorkUnits = 0;


Not needed. By default it will always be initialized to 0

sohami · 2016-11-02T01:02:29Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java

@@ -530,6 +534,7 @@ public RowGroupInfo(@JsonProperty("path") String path, @JsonProperty("start") lo
      this.rowGroupIndex = rowGroupIndex;
      this.rowCount = rowCount;
      this.numRecordsToRead = rowCount;
+      this.preferredEndpoint = null;


Not required.

sohami · 2016-11-02T01:02:37Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java

+        }
+
+        // Get the list of endpoints which have maximum (equal) data.
+        List<DrillbitEndpoint> topEndpoints = endpointByteMap.getTopEndpoints();


It took me a while to understand the below algorithm just by reading code. It will be helpful if we can name the variables better here and add some comment explaining different sections. Like changing as below might help:

"topEndPoints" to "maxRGDataEndPoints",

"minBytes" to "assignedBytesOnPickedNode"

"numBytes" to "assignedBytesOnCurrEndpoint"

"endpoint" to "currEndpoint"

As per my understanding line 864 to 892 represents one section which has the below logic:

For each row group assign a drillbit from topEndPoints list such that the chosen one is least loaded in terms of workunits.

sohami · 2016-11-02T01:38:24Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java

+      if (numEndpointAssignments.containsKey(endpoint)) {
+        epAff.setNumLocalWorkUnits(numEndpointAssignments.get(endpoint));
+      } else {
+        epAff.setNumLocalWorkUnits(0);


"else" condition is not required since by default it will be set to 0

sohami · 2016-11-02T01:39:09Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java

+        epAff.setNumLocalWorkUnits(0);
+      }
+    }
+


Please remove extra space. Please review other places as well.

sohami · 2016-11-02T02:05:38Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/schedule/AssignmentCreator.java

+      }
+
+      Integer assignment = iteratorWrapper.iter.next();
+      iteratorWrapper.count++;


Here shouldn't we check if the "iteratorWrapper.count" is exceeding the "iteratorWrapper.maxCount" ?

sudheeshkatkam

Some initial comments.

The issue is regarding assigning fragments based on strict locality. So why is the parallelization logic affected, and not exclusively locality?

Please add unit tests; see TestHardAffinityFragmentParallelizer. Examples would simply understanding this code.

sudheeshkatkam · 2016-11-02T04:57:36Z

contrib/storage-kudu/src/main/java/org/apache/drill/exec/store/kudu/KuduGroupScan.java

@@ -145,6 +145,11 @@ public EndpointByteMap getByteMap() {
    public int compareTo(CompleteWork o) {
      return 0;
    }
+
+    @Override
+    public DrillbitEndpoint getPreferredEndpoint() {


Can you add a TODO here?

sudheeshkatkam · 2016-11-02T05:00:24Z

.../src/main/java/org/apache/drill/exec/planner/fragment/LocalAffinityFragmentParallelizer.java

+        // 6: Finally make sure the width is at least one
+        width = Math.max(1, width);
+
+        List<DrillbitEndpoint> endpointPool = Lists.newArrayList();


final (and wherever possible, generously)

sudheeshkatkam · 2016-11-02T05:05:44Z

.../src/main/java/org/apache/drill/exec/planner/fragment/LocalAffinityFragmentParallelizer.java

+        while(totalAssigned < width) {
+            int assignedThisRound = 0;
+            for (DrillbitEndpoint ep : endpointPool) {
+                if (remainingEndpointAssignments.get(ep) > 0 &&


get value once into local var (and reuse)

sudheeshkatkam · 2016-11-02T05:07:18Z

.../src/main/java/org/apache/drill/exec/planner/fragment/LocalAffinityFragmentParallelizer.java

+
+        // This is for the case where drillbits are not running on endPoints which have data.
+        // Allocate them from the active endpoint pool.
+        int totalUnAssigned =


So this parallelizer is not strictly local? Why not fail?

I got all unit and regression tests pass with localAffinity=true. If this algorithm fails, that is not possible. Also, we are doing this only for the case when drillbits are not running on the nodes which have data.

sudheeshkatkam · 2016-11-02T05:08:31Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java

+    Map<DrillbitEndpoint, Long> numAssignedBytes = Maps.newHashMap();
+
+    // Do this for 2 iterations to adjust node assignments after first iteration.
+    int numIterartions = 2;


Iterartions -> iterations

sudheeshkatkam · 2016-11-02T05:18:32Z

.../src/main/java/org/apache/drill/exec/planner/fragment/LocalAffinityFragmentParallelizer.java

+ * This is for Parquet Scan Fragments only. Fragment placement is done preferring strict
+ * data locality.
+ */
+public class LocalAffinityFragmentParallelizer implements FragmentParallelizer {


When to use this vs HardAffinityFragmentParallelizer?

ppadma · 2016-11-04T17:37:47Z

Updated with all review comments taken care of. Added TestLocalAffinityFragmentParallelizer.java which has bunch of test cases with examples.

ppadma · 2016-11-04T17:51:35Z

Some initial comments.

The issue is regarding assigning fragments based on strict locality. So why is the parallelization logic affected, and not exclusively locality?

Parallelization logic is affected because it decides how many fragments to run on each node and that is dependent on locality.

sudheeshkatkam · 2016-11-04T20:46:04Z

Hmm the answer seems like a rephrasing of the question. Sorry, I misspoke. Better asked:

The issue is regarding assigning work to fragments based on strict locality (decide which fragment does what). So why is the parallelization (decide how many fragments) logic affected?

ppadma · 2016-11-04T22:04:43Z

Parallelization logic is affected for following reasons:
Depending upon how many rowGroups to scan on a node (based on locality information) i.e. how much work the node has to do, we want to adjust the number of fragments on the node (constrained to usual global and per node limits).
We do not want to schedule fragment(s) on a node which do not have data.
Because we want pure locality, we may have fewer fragments doing more work.

sohami · 2016-11-04T18:06:25Z

.../test/java/org/apache/drill/exec/planner/fragment/TestLocalAffinityFragmentParallelizer.java

+                                                                             DEP3, 8,
+                                                                             DEP4, 8,
+                                                                             DEP5, 8);
+        // Expect the fragment parallelization to be 80 (16 * 5)


wrong comment. Should be 40

sohami · 2016-11-04T18:21:37Z

.../src/main/java/org/apache/drill/exec/planner/fragment/LocalAffinityFragmentParallelizer.java

+        }
+
+        // Keep allocating from endpoints in a round robin fashion upto
+        // max(targetAllocation, maxwidthPerNode) for each endpoint and


Wrong comment.. We assign until we reach limit of maxWidthPerNode

sohami · 2016-11-04T22:54:09Z

.../test/java/org/apache/drill/exec/planner/fragment/TestLocalAffinityFragmentParallelizer.java

+                                                            40 /* globalMaxWidth */),
+                                                            ImmutableList.of(DEP1, DEP2, DEP3, DEP4, DEP5));
+        // The parallelization maxWidth (80) is more than globalMaxWidth(40).
+        // Expect the fragment parallelization to be 40 (7 + 8 + 8 + 8 + 9)


It would be great to mention that DEP5 is getting 9 fragment instead of DEP4 since that has more localWorkUnits. We do favor nodes with more localWorkUnit.

…when local copies are available. New fragment placement algorithm based on locality of data.

ppadma · 2016-11-10T02:02:21Z

Merged with latest code. All review comments taken care of. All tests pass with the option store.parquet.use_local_affinity = true and false, both.

kkhatua · 2018-05-21T18:01:04Z

@ppadma was this merged? I don't see a plus one and the PR isn't closed.

ppadma · 2018-06-07T00:43:12Z

Even though it is old, this PR is still very much relevant and useful feature to have in Drill for certain use cases/scenarios.
I request a committer to work with me so we can get it in. Any volunteers ?

ppadma changed the title ~~New fragment placement algorithm based on locality of data.~~ DRILL-4706: Fragment planning causes Drillbits to read remote chunks when local copies are available.New fragment placement algorithm based on locality of data. Oct 31, 2016

ppadma force-pushed the DRILL-4706 branch from 5a7fd87 to 7a2305c Compare October 31, 2016 20:23

sohami reviewed Nov 2, 2016

View reviewed changes

sudheeshkatkam reviewed Nov 2, 2016

View reviewed changes

ppadma force-pushed the DRILL-4706 branch from 7a2305c to 0831545 Compare November 4, 2016 17:34

sohami approved these changes Nov 4, 2016

View reviewed changes

DRILL-4706: Fragment planning causes Drillbits to read remote chunks …

a2c6314

…when local copies are available. New fragment placement algorithm based on locality of data.

ppadma force-pushed the DRILL-4706 branch from 0831545 to a2c6314 Compare November 10, 2016 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-4706: Fragment planning causes Drillbits to read remote chunks when local copies are available. #639

DRILL-4706: Fragment planning causes Drillbits to read remote chunks when local copies are available. #639

ppadma commented Oct 31, 2016

ppadma commented Oct 31, 2016

sudheeshkatkam commented Nov 1, 2016

sohami left a comment

sohami Nov 2, 2016

sohami Nov 2, 2016

sohami Nov 2, 2016

sohami Nov 2, 2016

sohami Nov 2, 2016

sohami Nov 2, 2016

sudheeshkatkam left a comment

sudheeshkatkam Nov 2, 2016

sudheeshkatkam Nov 2, 2016

sudheeshkatkam Nov 2, 2016

sudheeshkatkam Nov 2, 2016

ppadma Nov 4, 2016

sudheeshkatkam Nov 2, 2016

sudheeshkatkam Nov 2, 2016

ppadma commented Nov 4, 2016

ppadma commented Nov 4, 2016

sudheeshkatkam commented Nov 4, 2016

ppadma commented Nov 4, 2016

sohami Nov 4, 2016

sohami Nov 4, 2016

sohami Nov 4, 2016

ppadma commented Nov 10, 2016

kkhatua commented May 21, 2018

ppadma commented Jun 7, 2018

DRILL-4706: Fragment planning causes Drillbits to read remote chunks when local copies are available. #639

Are you sure you want to change the base?

DRILL-4706: Fragment planning causes Drillbits to read remote chunks when local copies are available. #639

Conversation

ppadma commented Oct 31, 2016

ppadma commented Oct 31, 2016

sudheeshkatkam commented Nov 1, 2016

sohami left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudheeshkatkam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppadma commented Nov 4, 2016

ppadma commented Nov 4, 2016

sudheeshkatkam commented Nov 4, 2016

ppadma commented Nov 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppadma commented Nov 10, 2016

kkhatua commented May 21, 2018

ppadma commented Jun 7, 2018