-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data type/schema field/schema #31
Conversation
Note: we could simplify the data type definition greatly if we adopt something like Arrow Java/cuDF. Type would be a POD containing type ID and fields for all parameterized types (decimal scale, fixed length, etc.). Field would contain a Type and a (possibly empty/invalid) list of nested fields (to handle things like Struct). This loses type safety but removes a layer of indirection/allocations. |
For here I've chosen something somewhat resembling Arrow C++. However I think Schema and Field (unlike Arrow) should not always be wrapped in smart pointers. Type is still in a smart pointer for type erasure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, left a few comments
I added Primitive/NestedType; I'll start filling out definitions and the rest of the types next, along with adding the visitors to iceberg-arrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see progress on this area, @lidavidm ! I went through for my own education and left some comments or questions. Let me know what you think!
|
||
#pragma once | ||
|
||
/// \file iceberg/schema.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this style of commenting. Could you please explain me why including the file name into a comment needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not done with it yet, this is for Doxygen which will generate a page to document the header.
/// | ||
/// Schemas are identified by a unique ID for the purposes of schema | ||
/// evolution. | ||
[[nodiscard]] int32_t schema_id() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions within this class may not follow the same naming convention: See schema_id vs ToString, Equals
I found the same mismatch below too, so I'm wondering if it's me missing something here.
Update: I've just checked the style guide, and I guess this is fine: "accessors and mutators may be named like variables"
https://google.github.io/styleguide/cppguide.html#Function_Names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google style has trivial accessors using snake_case.
src/iceberg/schema.h
Outdated
[[nodiscard]] std::string ToString() const; | ||
|
||
/// \brief Compare two schemas for equality. | ||
[[nodiscard]] bool Equals(const Schema& other) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have operators for equality, can't we make the Equals() function private? I don't see the reason to expose both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful to add overloads later (e.g. if we want to ignore metadata or IDs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can extend this later. I believe in adding what is required ATM, and if there is a need later, we can still extend.
Also, we are defining a public interface here, should keep it as simple as possible. Exposing the Equals method seems a redundancy for me ATM.
src/iceberg/type.h
Outdated
namespace iceberg { | ||
|
||
/// \brief A type combined with a name. | ||
class ICEBERG_EXPORT Field { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think similarly to Schema, I'd put Field into a separate file. Now this is in type.h but Field isn't really a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of a circular dependency between Field (which wants Type) and StructType (which wants Field). Unless (again) you see the alternate design in the first two comments of this PR which puts child Fields always on the Field (this means that Field has a potentially redundant...field, but avoids nesting of types - I personally slightly prefer this but it's conceptually dependent on whether you model the type itself as nested or the field as the nested object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will take a look a bit later.
Generally asking, can't we solve the circular dependency with forward declarations and using pointers/references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can if we continue to type-erase Type, but not if we want it to not be behind a smart pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting them behind smart pointers makes sense. See my other comment. And the this could go into a separate file.
Meanwhile I'm sketching some code for the PartitionSpec and PartitionField and I'm wondering if this one here is intuitive to be called Field, or could be better called SchemaField (considering there is going to be a PartitionField too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suppose PartitionField
to inherit SchemaField
or it is totally a separate thing? @gaborkaszab
src/iceberg/schema.h
Outdated
[[nodiscard]] std::string ToString() const; | ||
|
||
/// \brief Compare two schemas for equality. | ||
[[nodiscard]] bool Equals(const Schema& other) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can extend this later. I believe in adding what is required ATM, and if there is a need later, we can still extend.
Also, we are defining a public interface here, should keep it as simple as possible. Exposing the Equals method seems a redundancy for me ATM.
src/iceberg/schema.h
Outdated
/// evolution. | ||
class ICEBERG_EXPORT Schema : public StructType { | ||
public: | ||
Schema(int32_t schema_id, std::vector<Field> fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I saw in the Java implementation, is that there is a name to field ID mapping there, and there are accessors getting Field(s) by ID or by name. I think that would be useful here too.
/// A schema is a list of typed columns, along with a unique integer ID. A | ||
/// Table may have different schemas over its lifetime due to schema | ||
/// evolution. | ||
class ICEBERG_EXPORT Schema : public StructType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: public StructType
I don't think that Schema should derive from StructType. I get that it will hold a list of Fields such as StructType, but still, Schema in general is not a Type, would be pretty unnatural to derive from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note that various type systems do actually just represent a schema as a struct type, including nanoarrow/Arrow C Data Interface. At the very least they are isomorphic and you can see methods that convert between them often. That said I don't feel strongly about this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also inclined to inherit Schema from StructType. In practice this inheritance avoid an unnecessary conversion from Schema to StructType to implement type visitors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Could visitors work with Schema if we had a Schema::asStructType or such? Doesn't seem a big deal to do that conversion for me and then the class hierarchy would be more intuitive in my opinion. Or I miss something :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would also work, it would just add the conversion, and then duplicating all the methods. No big deal either way to me at least.
(I suppose the OOP approach is to add a StructLike interface that both Struct and Schema implement.)
src/iceberg/type.h
Outdated
namespace iceberg { | ||
|
||
/// \brief A type combined with a name. | ||
class ICEBERG_EXPORT Field { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will take a look a bit later.
Generally asking, can't we solve the circular dependency with forward declarations and using pointers/references?
Just to make sure @gaborkaszab @wgtmac: are we ok with the Arrow-style type representation here? (Types are represented by a class hierarchy, erased behind smart pointers; nested types store the fields in the type object, not the field object) The alternatives are:
|
Arrow-style does let you do a bunch of compile-time metaprogramming (e.g. see arrow::TypeTraits) |
/// \brief Is this a primitive type (may not have child fields)? | ||
[[nodiscard]] virtual bool is_primitive() const = 0; | ||
|
||
/// \brief Is this a nested type (may have child fields)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What raise my interests is that:
- Does iceberg has a "void" ( or null ) bottom type?
- Does iceberg allowed to define an empty structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is unknown
.
I don't think the spec says anything either way about an empty struct. I didn't check any implementations. If allowed, you could think of it as a void
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly checked the Java code, and apparently there is nothing preventing us from creating a StructType with zero fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UnknownType
will be part of the V3 specification.
An empty struct is pretty common (for example, an unpartitioned table has an empty struct as a partition).
I vote +1 for the Arrow-style. I haven't checked the |
LGTM |
If I get it right, Arrow-style mes sense for me too. |
e0d6e5e
to
a32ed60
Compare
I've implemented all the actual method bodies. I've also added a trait for ToString/std::format formatting (might as well use the new C++ features if we're requiring C++20). Although it appears clang is not happy about what I did. I've elected to make getters fallible (e.g. GetFieldById) via |
eb74e90
to
40a18b1
Compare
Ok, hopefully that fixes macOS. It appears C++20 support on the version of clang shipped by macOS is still flaky: https://godbolt.org/z/W9rTexc13 Maybe we want to avoid C++20 (sadly) or maybe we need to encourage people to get a newer clang on macOS. (It appears to be clang 15, whereas the latest is clang 19.) Or maybe it's just an issue with the version the CI environment has. |
I'll add some unit tests as long as the API looks good to everyone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface looks good to me. Thanks!
std::shared_ptr<Type> type); | ||
|
||
/// \brief Get the field ID. | ||
[[nodiscard]] int32_t field_id() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an overkill to add [[nodiscard]]
to these accessors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it hurts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it is just personal flavor. I have commented at #32 (comment)
I added a bunch of unit tests + fixed some copy-paste errors I discovered in the process. |
: field_id_(field_id), | ||
name_(std::move(name)), | ||
type_(std::move(type)), | ||
optional_(optional) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering why python moved to required
instead of optional
here:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/types.py#L308
The change from optional
to required
seemed intentional here, @Fokko any thoughts?
BTW it seems Java went for optional
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because the spec serializes it as required
: https://iceberg.apache.org/spec/#schemas This way it is more straightforward to serialize it using the library that we use (Pydantic). See the original issue: apache/iceberg#4985. In PyIceberg we have a property called optional
as well, so you can choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
src/iceberg/type.cc
Outdated
field.field_id(), it->second, index)); | ||
} | ||
|
||
index++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ++index;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
class Schema; | ||
class SchemaField; | ||
class StringType; | ||
class StructType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/iceberg/schema.h
Outdated
|
||
/// \brief Get the schema ID. | ||
/// | ||
/// Schemas are identified by a unique ID for the purposes of schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be A Schema is identified by a unique ID
or Schemas are identified by unique IDs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "A schema..."
src/iceberg/type.cc
Outdated
return other.type_id() == TypeId::kBoolean; | ||
} | ||
|
||
TypeId Int32Type::type_id() const { return TypeId::kInt32; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use Int/Long/Float/Double instead? I guess this arrow style, but I think we should be consistent with spec? I checked iceberg and pyiceberg, they do the spec way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/iceberg/type.cc
Outdated
return precision_ == decimal.precision_ && scale_ == decimal.scale_; | ||
} | ||
|
||
TypeId TimeType::type_id() const { return TypeId::kTime; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: place TimeType after DateType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordered headers/impls
[[nodiscard]] virtual bool is_primitive() const = 0; | ||
|
||
/// \brief Is this a nested type (may have child fields)? | ||
[[nodiscard]] virtual bool is_nested() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this redundant with is_primitive()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it's clearer to be able to say is_nested
or is_primitive
and they can just be implemented in terms of each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep both.
src/iceberg/type.h
Outdated
bool Equals(const Type& other) const override; | ||
}; | ||
|
||
/// \brief A data type representing a 32-bit (single precision) float. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32-bit IEEE 754 floating point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/iceberg/type.h
Outdated
bool Equals(const Type& other) const override; | ||
}; | ||
|
||
/// \brief A data type representing a 64-bit (double precision) float. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64-bit IEEE 754 floating point
It will great we just copy past the type description from the spec. The same applied to other types.
bool Equals(const Type& other) const override; | ||
|
||
private: | ||
int32_t length_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg-rust has the length of u64
[1], int32_t
can represent about 2 gigabytes, maybe we should change to int64_t
to be compatible with other language implementation?
[1] https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/datatypes.rs#L241
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java uses a 32 bit int. https://github.com/apache/iceberg/blob/d693f83b6ad6e306550d117ff0bf0014356500b2/api/src/main/java/org/apache/iceberg/types/Types.java#L360
Are the Iceberg maintainers interested in making the spec actually lay out these details? I've mostly copied Java since the spec itself is extremely vague.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The int32 is fine. No one should go beyond 2GB values for a single value.
src/iceberg/type_fwd.h
Outdated
kTime, | ||
kTimestamp, | ||
kTimestampTz, | ||
kBinary, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: list these the same order as spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the spec actually canonical? It seems the implementations all disagree...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the spec doesn't actually have any single listing of all the types (since nested types are split off), so I am electing to put the nested types first to be consistent with the spec (in as far as it defines any order)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the order
|
||
#include "iceberg/schema.h" | ||
|
||
#include <format> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it good practice to IWYU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that familiar with IWYU yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use std::format in the .cc file so we should include format, even if other headers happen to include it already
|
||
#pragma once | ||
|
||
/// \file iceberg/util/formatter.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this file be renamed to formatter_internal.h
? We don't have to export and install this file.
FYI: I have followed the same pattern from Apache Arrow: https://github.com/apache/iceberg-cpp/blob/main/cmake_modules/IcebergBuildUtils.cmake#L251
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you do need this file externally if you yourself wish to use std::format.
}; | ||
|
||
/// \brief A data type that has child fields. | ||
class ICEBERG_EXPORT NestedType : public Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a function for getting number of fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fields().size()
, or do you think a separate accessor is useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's enough I think.
src/iceberg/type.h
Outdated
/// \brief Get a field by index. | ||
[[nodiscard]] virtual std::optional<std::reference_wrapper<const SchemaField>> | ||
GetFieldByIndex(int32_t index) const = 0; | ||
/// \brief Get a field by name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
for (const auto& field : fields_) { | ||
auto [it, inserted] = field_id_to_index_.try_emplace(field.field_id(), index); | ||
if (!inserted) { | ||
throw std::runtime_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define IcebergException
or even a more specific exception to replace it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using exceptions in general? I am only using an exception here because this is a constructor. We could also hide the constructor, or change to assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, my point is that a more specific exception is better than the std::runtime_error
.
BTW, should we build field_id_to_index_
lazily when the particular getter is called for the 1st time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's discussed above, but it complicates the implementation a lot
|
||
TypeId StructType::type_id() const { return TypeId::kStruct; } | ||
std::string StructType::ToString() const { | ||
std::string repr = "struct<\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to use a separate line for each sub-field? An alternative is to use the Hive-style where all fields are concatenated in a single line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That gets messy IMO once there are more than a couple fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support restoring a Schema object by deserializing the formatted string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather add JSON serializers than try to parse the repr
} | ||
std::optional<std::reference_wrapper<const SchemaField>> StructType::GetFieldByName( | ||
std::string_view name) const { | ||
// TODO: what is the right behavior if there are duplicate names? (Are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the spec is unclear about this so the behavior is undefined and implementation-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be a lot of spec issues...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider the Java implementation as the reference one. There we raise a ValidationException
when we see a duplicate name:
There seem to be a lot of spec issues...
@lidavidm Please elaborate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this in the written spec? Or is the canonical-ness of the Java implementation documented somewhere? Just so I know what to reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see above e.g. #31 (comment)
#31 (comment)
possibly I overlooked something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec describes the table format itself, for the behavioral things, we tend to check the Java implementation (considered the reference implementation). If we would need to formalize all this as well, then the spec would become unreadable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, at least something like "duplicate field names are not allowed" should be part of the spec, no? I wouldn't really qualify that as "behavioral" any more than "decimal precision is limited to 38" is behavioral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Iceberg comes from the database world, not having duplicate names is more or less implied to me, but feel free to raise a PR to add this to the spec itself.
I was surprised to see that this is allowed:
>>> import pyarrow as pa
>>> pa.table([
... pa.array(["foo", "bar"]),
... pa.array(["Parrot", "Dog"])
... ], names=['animals', 'animals'])
pyarrow.Table
animals: string
animals: string
----
animals: [["foo","bar"]]
animals: [["Parrot","Dog"]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised to see that this is allowed:
I think Arrow is different. It is totally valid to project the same column more than once in an Arrow Table or RecordBatch. For example, returning query result via Arrow for select animals, animals from foo.bar
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Is anything else required here? |
+1 to merge this PR, thanks for working on this. |
@Fokko @Xuanwo I think this is ready to merge. cc @gaborkaszab in case there is any comment. |
I got a bit busy with other stuff, so wouldn't block this PR for me to find time here. Thanks for these changes! |
return std::format("{} ({}): {}{}", name_, field_id_, *type_, | ||
optional_ ? "" : " (required)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is just me, but I'm not a big fan of the implied (optional)
} | ||
return fields_[index]; | ||
} | ||
std::optional<std::reference_wrapper<const SchemaField>> StructType::GetFieldByName( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want to distinguish upper- and lower-case, or start with case-sensitive here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states this is case-sensitive already. Should we add something like Java's StructType#caseInsensitiveField
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is consensus, let's move this forward 🚀
Thanks @Fokko I'll follow up to correct those things |
Implement the API for data types and schemas.