Enhancement 8277989680: symbol concatenation poc #2142

alexowens90 · 2025-01-27T10:01:59Z

Reference Issues/PRs

8277989680

What does this implement or fix?

Implements symbol concatenation. Inner and outer joins over columns both supported. Expected usage:

# Read requests can contain usual as_of, date_range, columns, etc arguments
lazy_dfs = lib.read_batch([read_request_1, read_request_2, ...])
# Potentially apply some processing to all or individual constituent lazy dataframes here, that will be applied before the join
lazy_dfs = lazy_dfs[lazy_dfs["col"].notnull()]
# Join here
lazy_df = adb.concat(lazy_dfs)
# Perform more processing if desired
lazy_df = lazy_df.resample("15min").agg({"col": "mean"})
# Collect result
res = lazy_df.collect()
# res contains a list of VersionedItems from the consituent symbols that went into the join with data=None, and a data member with the joined Series/DataFrame

See test_symbol_concatenation.py for thorough examples of how the API works.
For outer joins, if a column is not present in one of the input symbols, then the same type-specific behaviour as used for dynamic schema is used to backfill the missing values.
Not all symbols can be concatenated together. The following will throw exceptions if attempted to be concatenated:

a Series with a DataFrame
Different index types, including multiindexes with different numbers of levels
Incompatible column types. e.g. if col has type INT64 in one symbol, and is a string column in another symbol. this only applies if the column would be in the result, which is always the case for all columns with an outer join, but may not always be for inner joins.

Where possible, the implementation is permissive with what can be joined with an output as sensible as possible:

Joining two or more Series with different names that are otherwise compatible will produce a Series with no name
Joining two or more timeseries where the indexes have different names will produce a timeseries with an unnamed index
Joining two or more timeseries where the indexes have different timezones will produce a timeseries with a UTC index
Joining two or more multiindexed Series/DataFrames where the levels have compatible types but different names will produce a multiindexed Series/DataFrame with unnamed levels where they differed between some of the inputs.
Joining two or more Series/DataFrames that all have RangeIndex. If the index step does not match between all of the inputs, then the output will have a RangeIndex with start=0 and step=1. This is different behaviour to Pandas, which converts to an Int64 index in this case. For this reason, a warning is logged when this happens.

The only known major limitation is that all of the symbols being joined together (after any pre-join processing) must fit into memory. Relaxing this constraint would require much more sophisticated query planning than we currently support, in which all of the clauses both for individual symbols pre-join, the join, and any post-join clauses, are all taken into account when scheduling both IO and individual processing tasks.

…join

vasil-pashov · 2025-04-23T10:55:49Z

cpp/arcticdb/entity/read_result.hpp

@@ -76,8 +83,14 @@ inline ReadResult create_python_read_result(
    util::print_total_mem_usage(__FILE__, __LINE__, __FUNCTION__);

    const auto& desc_proto = result.desc_.proto();
+    std::variant<arcticdb::proto::descriptors::UserDefinedMetadata, std::vector<arcticdb::proto::descriptors::UserDefinedMetadata>> metadata;
+    if (user_meta.has_value()) {
+        metadata = *user_meta;


Suggested change

metadata = *user_meta;

metadata = *std::move(user_meta);

vasil-pashov · 2025-04-23T10:59:42Z

cpp/arcticdb/entity/read_result.hpp

    return {version, std::move(python_frame), desc_proto.normalization(),
-            desc_proto.user_meta(), desc_proto.multi_key_meta(), std::move(result.keys_)};
+            metadata, desc_proto.multi_key_meta(), std::move(result.keys_)};


I think we can also std::move(metadata) but for it to make sense we have to change ReadResult constructor to either take rvalue ref or forwarding ref or value.

vasil-pashov · 2025-04-23T11:02:48Z

cpp/arcticdb/pipeline/frame_slice_map.hpp

@@ -22,7 +22,14 @@ struct FrameSliceMap {

    FrameSliceMap(std::shared_ptr<PipelineContext> context, bool dynamic_schema) :
        context_(std::move(context)) {
-
+        const entity::StreamDescriptor& descriptor = context_->descriptor();
+        const auto required_fields_count = [&]() {


Was there any reason not to use ternary if?

vasil-pashov · 2025-04-23T11:17:13Z

cpp/arcticdb/pipeline/frame_slice_map.hpp

+                const bool first_col_slice = first_col == 0;
+                // Skip the "true" index fields (i.e. those stored in every column slice) if we are not in the first column slice
+                // Second condition required to avoid underflow when substracting one unsigned value from another
+                const bool required_field =


I think this will be a bit easier to read if the expression is split into two variables.

vasil-pashov · 2025-04-23T11:23:48Z

cpp/arcticdb/pipeline/frame_slice_map.hpp

+                        ((first_col_slice ? 0 : descriptor.index().field_count()) <= field.index) &&
+                        (required_fields_count >= first_col) &&
+                        (field.index < required_fields_count - first_col);
+                // If required_field is true, this is a required column in the output. The name in slice stream


I'm a bit confused.

If a field is not required why do we insert it?

Are name mismatches allowed only for index columns?

vasil-pashov · 2025-04-25T10:42:43Z