Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data type/schema field/schema #31

Merged
merged 13 commits into from
Feb 3, 2025
Merged

Add data type/schema field/schema #31

merged 13 commits into from
Feb 3, 2025

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Jan 16, 2025

Implement the API for data types and schemas.

@lidavidm
Copy link
Member Author

Note: we could simplify the data type definition greatly if we adopt something like Arrow Java/cuDF. Type would be a POD containing type ID and fields for all parameterized types (decimal scale, fixed length, etc.). Field would contain a Type and a (possibly empty/invalid) list of nested fields (to handle things like Struct). This loses type safety but removes a layer of indirection/allocations.

@lidavidm
Copy link
Member Author

For here I've chosen something somewhat resembling Arrow C++. However I think Schema and Field (unlike Arrow) should not always be wrapped in smart pointers. Type is still in a smart pointer for type erasure.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left a few comments

@lidavidm
Copy link
Member Author

I added Primitive/NestedType; I'll start filling out definitions and the rest of the types next, along with adding the visitors to iceberg-arrow.

Copy link
Collaborator

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see progress on this area, @lidavidm ! I went through for my own education and left some comments or questions. Let me know what you think!


#pragma once

/// \file iceberg/schema.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this style of commenting. Could you please explain me why including the file name into a comment needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not done with it yet, this is for Doxygen which will generate a page to document the header.

///
/// Schemas are identified by a unique ID for the purposes of schema
/// evolution.
[[nodiscard]] int32_t schema_id() const;
Copy link
Collaborator

@gaborkaszab gaborkaszab Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions within this class may not follow the same naming convention: See schema_id vs ToString, Equals
I found the same mismatch below too, so I'm wondering if it's me missing something here.

Update: I've just checked the style guide, and I guess this is fine: "accessors and mutators may be named like variables"
https://google.github.io/styleguide/cppguide.html#Function_Names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google style has trivial accessors using snake_case.

[[nodiscard]] std::string ToString() const;

/// \brief Compare two schemas for equality.
[[nodiscard]] bool Equals(const Schema& other) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have operators for equality, can't we make the Equals() function private? I don't see the reason to expose both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to add overloads later (e.g. if we want to ignore metadata or IDs)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can extend this later. I believe in adding what is required ATM, and if there is a need later, we can still extend.
Also, we are defining a public interface here, should keep it as simple as possible. Exposing the Equals method seems a redundancy for me ATM.

namespace iceberg {

/// \brief A type combined with a name.
class ICEBERG_EXPORT Field {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think similarly to Schema, I'd put Field into a separate file. Now this is in type.h but Field isn't really a type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of a circular dependency between Field (which wants Type) and StructType (which wants Field). Unless (again) you see the alternate design in the first two comments of this PR which puts child Fields always on the Field (this means that Field has a potentially redundant...field, but avoids nesting of types - I personally slightly prefer this but it's conceptually dependent on whether you model the type itself as nested or the field as the nested object)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will take a look a bit later.
Generally asking, can't we solve the circular dependency with forward declarations and using pointers/references?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can if we continue to type-erase Type, but not if we want it to not be behind a smart pointer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting them behind smart pointers makes sense. See my other comment. And the this could go into a separate file.
Meanwhile I'm sketching some code for the PartitionSpec and PartitionField and I'm wondering if this one here is intuitive to be called Field, or could be better called SchemaField (considering there is going to be a PartitionField too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suppose PartitionField to inherit SchemaField or it is totally a separate thing? @gaborkaszab

[[nodiscard]] std::string ToString() const;

/// \brief Compare two schemas for equality.
[[nodiscard]] bool Equals(const Schema& other) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can extend this later. I believe in adding what is required ATM, and if there is a need later, we can still extend.
Also, we are defining a public interface here, should keep it as simple as possible. Exposing the Equals method seems a redundancy for me ATM.

/// evolution.
class ICEBERG_EXPORT Schema : public StructType {
public:
Schema(int32_t schema_id, std::vector<Field> fields);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I saw in the Java implementation, is that there is a name to field ID mapping there, and there are accessors getting Field(s) by ID or by name. I think that would be useful here too.

/// A schema is a list of typed columns, along with a unique integer ID. A
/// Table may have different schemas over its lifetime due to schema
/// evolution.
class ICEBERG_EXPORT Schema : public StructType {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: public StructType
I don't think that Schema should derive from StructType. I get that it will hold a list of Fields such as StructType, but still, Schema in general is not a Type, would be pretty unnatural to derive from it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that various type systems do actually just represent a schema as a struct type, including nanoarrow/Arrow C Data Interface. At the very least they are isomorphic and you can see methods that convert between them often. That said I don't feel strongly about this.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also inclined to inherit Schema from StructType. In practice this inheritance avoid an unnecessary conversion from Schema to StructType to implement type visitors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Could visitors work with Schema if we had a Schema::asStructType or such? Doesn't seem a big deal to do that conversion for me and then the class hierarchy would be more intuitive in my opinion. Or I miss something :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also work, it would just add the conversion, and then duplicating all the methods. No big deal either way to me at least.

(I suppose the OOP approach is to add a StructLike interface that both Struct and Schema implement.)

namespace iceberg {

/// \brief A type combined with a name.
class ICEBERG_EXPORT Field {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will take a look a bit later.
Generally asking, can't we solve the circular dependency with forward declarations and using pointers/references?

@lidavidm
Copy link
Member Author

Just to make sure @gaborkaszab @wgtmac: are we ok with the Arrow-style type representation here? (Types are represented by a class hierarchy, erased behind smart pointers; nested types store the fields in the type object, not the field object)

The alternatives are:

  • cuDF style: type objects are minimal, just a type ID, and there is no hierarchy, just the base DataType class which is not type-erased and therefore there is no need for a smart pointer. As a trade-off, nested fields have to be extracted from the field and the base DataType has to have fields for all possible parameterized types
  • arrow-java style: there is still a hierarchy of type objects, but child fields are stored on the Field, not the DataType. Avoids a conceptual dependency cycle between DataType and Field
  • variant style: like arrow-java, but instead of type erasure and a type hierarchy (or conversely, like cuDF but avoids redundant fields), we have a std::variant with all possibilities. Avoids boxing but the object is larger (in e.g. a vector)

@lidavidm
Copy link
Member Author

Arrow-style does let you do a bunch of compile-time metaprogramming (e.g. see arrow::TypeTraits)

/// \brief Is this a primitive type (may not have child fields)?
[[nodiscard]] virtual bool is_primitive() const = 0;

/// \brief Is this a nested type (may have child fields)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What raise my interests is that:

  1. Does iceberg has a "void" ( or null ) bottom type?
  2. Does iceberg allowed to define an empty structure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is unknown.

I don't think the spec says anything either way about an empty struct. I didn't check any implementations. If allowed, you could think of it as a void type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly checked the Java code, and apparently there is nothing preventing us from creating a StructType with zero fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UnknownType will be part of the V3 specification.

An empty struct is pretty common (for example, an unpartitioned table has an empty struct as a partition).

@wgtmac
Copy link
Member

wgtmac commented Jan 20, 2025

I vote +1 for the Arrow-style. I haven't checked the cuDF style in detail but it looks more like the C-style. The extra benefit of Arrow-style is that it might be simpler to implement the Iceberg <=> Arrow schema conversion.

@mapleFU
Copy link
Member

mapleFU commented Jan 20, 2025

LGTM

@gaborkaszab
Copy link
Collaborator

Just to make sure @gaborkaszab @wgtmac: are we ok with the Arrow-style type representation here? (Types are represented by a class hierarchy, erased behind smart pointers; nested types store the fields in the type object, not the field object)

If I get it right, Arrow-style mes sense for me too.

@lidavidm lidavidm force-pushed the schema branch 3 times, most recently from e0d6e5e to a32ed60 Compare January 21, 2025 05:33
@lidavidm
Copy link
Member Author

lidavidm commented Jan 21, 2025

I've implemented all the actual method bodies. I've also added a trait for ToString/std::format formatting (might as well use the new C++ features if we're requiring C++20). Although it appears clang is not happy about what I did.

I've elected to make getters fallible (e.g. GetFieldById) via optional<reference_wrapper<T>> which looks a bit messy but should be safer than returning T const& and throwing, or returning T const* and admitting nullptr.

@lidavidm lidavidm force-pushed the schema branch 2 times, most recently from eb74e90 to 40a18b1 Compare January 21, 2025 06:01
@lidavidm
Copy link
Member Author

Ok, hopefully that fixes macOS. It appears C++20 support on the version of clang shipped by macOS is still flaky:

https://godbolt.org/z/W9rTexc13

Maybe we want to avoid C++20 (sadly) or maybe we need to encourage people to get a newer clang on macOS. (It appears to be clang 15, whereas the latest is clang 19.) Or maybe it's just an issue with the version the CI environment has.

@lidavidm lidavidm changed the title WIP: Add headers for type/field/schema Add data type/schema field/schema Jan 22, 2025
@lidavidm lidavidm marked this pull request as ready for review January 22, 2025 06:27
@lidavidm
Copy link
Member Author

I'll add some unit tests as long as the API looks good to everyone.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface looks good to me. Thanks!

std::shared_ptr<Type> type);

/// \brief Get the field ID.
[[nodiscard]] int32_t field_id() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an overkill to add [[nodiscard]] to these accessors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it hurts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it is just personal flavor. I have commented at #32 (comment)

@lidavidm
Copy link
Member Author

I added a bunch of unit tests + fixed some copy-paste errors I discovered in the process.

: field_id_(field_id),
name_(std::move(name)),
type_(std::move(type)),
optional_(optional) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why python moved to required instead of optional here:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/types.py#L308
The change from optional to required seemed intentional here, @Fokko any thoughts?
BTW it seems Java went for optional too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because the spec serializes it as required: https://iceberg.apache.org/spec/#schemas This way it is more straightforward to serialize it using the library that we use (Pydantic). See the original issue: apache/iceberg#4985. In PyIceberg we have a property called optional as well, so you can choose.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

field.field_id(), it->second, index));
}

index++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ++index;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

class Schema;
class SchemaField;
class StringType;
class StructType;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


/// \brief Get the schema ID.
///
/// Schemas are identified by a unique ID for the purposes of schema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be A Schema is identified by a unique ID or Schemas are identified by unique IDs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "A schema..."

return other.type_id() == TypeId::kBoolean;
}

TypeId Int32Type::type_id() const { return TypeId::kInt32; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Int/Long/Float/Double instead? I guess this arrow style, but I think we should be consistent with spec? I checked iceberg and pyiceberg, they do the spec way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

return precision_ == decimal.precision_ && scale_ == decimal.scale_;
}

TypeId TimeType::type_id() const { return TypeId::kTime; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: place TimeType after DateType

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reordered headers/impls

[[nodiscard]] virtual bool is_primitive() const = 0;

/// \brief Is this a nested type (may have child fields)?
[[nodiscard]] virtual bool is_nested() const = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant with is_primitive()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's clearer to be able to say is_nested or is_primitive and they can just be implemented in terms of each other.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep both.

bool Equals(const Type& other) const override;
};

/// \brief A data type representing a 32-bit (single precision) float.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32-bit IEEE 754 floating point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

bool Equals(const Type& other) const override;
};

/// \brief A data type representing a 64-bit (double precision) float.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64-bit IEEE 754 floating point
It will great we just copy past the type description from the spec. The same applied to other types.

bool Equals(const Type& other) const override;

private:
int32_t length_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iceberg-rust has the length of u64[1], int32_t can represent about 2 gigabytes, maybe we should change to int64_t to be compatible with other language implementation?

[1] https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/datatypes.rs#L241

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java uses a 32 bit int. https://github.com/apache/iceberg/blob/d693f83b6ad6e306550d117ff0bf0014356500b2/api/src/main/java/org/apache/iceberg/types/Types.java#L360

Are the Iceberg maintainers interested in making the spec actually lay out these details? I've mostly copied Java since the spec itself is extremely vague.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int32 is fine. No one should go beyond 2GB values for a single value.

kTime,
kTimestamp,
kTimestampTz,
kBinary,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: list these the same order as spec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the spec actually canonical? It seems the implementations all disagree...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the spec doesn't actually have any single listing of all the types (since nested types are split off), so I am electing to put the nested types first to be consistent with the spec (in as far as it defines any order)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the order


#include "iceberg/schema.h"

#include <format>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it good practice to IWYU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with IWYU yet...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use std::format in the .cc file so we should include format, even if other headers happen to include it already


#pragma once

/// \file iceberg/util/formatter.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file be renamed to formatter_internal.h? We don't have to export and install this file.

FYI: I have followed the same pattern from Apache Arrow: https://github.com/apache/iceberg-cpp/blob/main/cmake_modules/IcebergBuildUtils.cmake#L251

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you do need this file externally if you yourself wish to use std::format.

};

/// \brief A data type that has child fields.
class ICEBERG_EXPORT NestedType : public Type {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a function for getting number of fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fields().size(), or do you think a separate accessor is useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's enough I think.

/// \brief Get a field by index.
[[nodiscard]] virtual std::optional<std::reference_wrapper<const SchemaField>>
GetFieldByIndex(int32_t index) const = 0;
/// \brief Get a field by name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

for (const auto& field : fields_) {
auto [it, inserted] = field_id_to_index_.try_emplace(field.field_id(), index);
if (!inserted) {
throw std::runtime_error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define IcebergException or even a more specific exception to replace it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using exceptions in general? I am only using an exception here because this is a constructor. We could also hide the constructor, or change to assertions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, my point is that a more specific exception is better than the std::runtime_error.

BTW, should we build field_id_to_index_ lazily when the particular getter is called for the 1st time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's discussed above, but it complicates the implementation a lot


TypeId StructType::type_id() const { return TypeId::kStruct; }
std::string StructType::ToString() const {
std::string repr = "struct<\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to use a separate line for each sub-field? An alternative is to use the Hive-style where all fields are concatenated in a single line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That gets messy IMO once there are more than a couple fields

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support restoring a Schema object by deserializing the formatted string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather add JSON serializers than try to parse the repr

}
std::optional<std::reference_wrapper<const SchemaField>> StructType::GetFieldByName(
std::string_view name) const {
// TODO: what is the right behavior if there are duplicate names? (Are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the spec is unclear about this so the behavior is undefined and implementation-specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be a lot of spec issues...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the Java implementation as the reference one. There we raise a ValidationException when we see a duplicate name:

https://github.com/apache/iceberg/blob/507e2a9e39f93356bdc0dae1521bbdc46629681c/api/src/main/java/org/apache/iceberg/types/IndexByName.java#L199-L205

There seem to be a lot of spec issues...

@lidavidm Please elaborate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in the written spec? Or is the canonical-ness of the Java implementation documented somewhere? Just so I know what to reference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see above e.g. #31 (comment)
#31 (comment)

possibly I overlooked something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec describes the table format itself, for the behavioral things, we tend to check the Java implementation (considered the reference implementation). If we would need to formalize all this as well, then the spec would become unreadable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, at least something like "duplicate field names are not allowed" should be part of the spec, no? I wouldn't really qualify that as "behavioral" any more than "decimal precision is limited to 38" is behavioral

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Iceberg comes from the database world, not having duplicate names is more or less implied to me, but feel free to raise a PR to add this to the spec itself.

I was surprised to see that this is allowed:

>>> import pyarrow as pa
>>> pa.table([
...     pa.array(["foo", "bar"]),
...     pa.array(["Parrot", "Dog"])
... ], names=['animals', 'animals'])
pyarrow.Table
animals: string
animals: string
----
animals: [["foo","bar"]]
animals: [["Parrot","Dog"]]

Copy link
Member

@wgtmac wgtmac Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see that this is allowed:

I think Arrow is different. It is totally valid to project the same column more than once in an Arrow Table or RecordBatch. For example, returning query result via Arrow for select animals, animals from foo.bar;

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@lidavidm
Copy link
Member Author

lidavidm commented Feb 2, 2025

Is anything else required here?

@zhjwpku
Copy link
Collaborator

zhjwpku commented Feb 2, 2025

Is anything else required here?

+1 to merge this PR, thanks for working on this.

@wgtmac
Copy link
Member

wgtmac commented Feb 3, 2025

@Fokko @Xuanwo I think this is ready to merge. cc @gaborkaszab in case there is any comment.

@gaborkaszab
Copy link
Collaborator

@Fokko @Xuanwo I think this is ready to merge. cc @gaborkaszab in case there is any comment.

I got a bit busy with other stuff, so wouldn't block this PR for me to find time here. Thanks for these changes!

@Fokko Fokko requested review from raulcd, Fokko and gaborkaszab February 3, 2025 16:05
Comment on lines +55 to +56
return std::format("{} ({}): {}{}", name_, field_id_, *type_,
optional_ ? "" : " (required)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is just me, but I'm not a big fan of the implied (optional)

}
return fields_[index];
}
std::optional<std::reference_wrapper<const SchemaField>> StructType::GetFieldByName(
Copy link
Contributor

@Fokko Fokko Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to distinguish upper- and lower-case, or start with case-sensitive here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states this is case-sensitive already. Should we add something like Java's StructType#caseInsensitiveField?

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is consensus, let's move this forward 🚀

@Fokko Fokko merged commit ebacc54 into apache:main Feb 3, 2025
6 checks passed
@lidavidm
Copy link
Member Author

lidavidm commented Feb 4, 2025

Thanks @Fokko I'll follow up to correct those things

@lidavidm lidavidm deleted the schema branch February 4, 2025 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants