Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36518: [Java] Fix ArrowFlightJdbcTimeStampVectorAccessor to return Timestamp objects with date and time that corresponds with local time instead of UTC date and time #43149

Closed
wants to merge 10 commits into from

Conversation

scruz-denodo
Copy link

@scruz-denodo scruz-denodo commented Jul 4, 2024

Rationale for this change

I opened this PR for continuing the work done at this one #36519.
I copy the description from there.

When calling the getTimestamp method from the ArrowFlightJdbcTimeStampVectorAccessor class, the timezone of the Timestamp object returned is incorrect. The timestamp itself appears to be in GMT/UTC time but the timezone field of the Timestamp object is populated with the timezone of the JDBC client instead.

Example

timestamp on db: 2021-03-28T00:15:00.000 (UTC)

timezone of JDBC client: PST/PDT (Vancouver Time)
Calendar cal <- Calendar with UTC timezone
calling getTimestamp(cal) on a result set will return a timestamp like this: 2021-03-28T00:15:00.000 (PST/PDT)
where 2021-03-28T00:15:00.000 appears to be in UTC time but the timezone of the object itself is PST/PDT

What changes are included in this PR?

  • Fixed: Correct behaviour is that the calendar object is used to extend the data into the timezone of the calendar object, essentially asserting that the data is at the timezone defined in the calendar object and returns a time-related object that has local date and time and local timezone
  • Fixed: for vectors that do store timezone information (ie. TimestampVector), the getter methods will use the timezone defined in vector as the timezone assertion and ignores the calendar object if one was passed in
  • Fixed: applyCalendarOffset now correctly creates a corresponding time-related object using the supplied calendar timezone and returns a time-related object that has local date and time and local timezone

Are these changes tested?

Many of the time related tests also face this issue and this fixes them all. Also tested the driver jar in a JDBC client and behaviour is fixed.

Also tested behavior with Oracle and SQLServer.

Test table

-- sqlserver
create table ts_table ( 
	c1 int,
	c2 datetime2 
);
insert into ts_table (c1, c2) values (1, '2005-06-29 19:19:41');

-- oracle
create table ts_table ( 
	c1 int,
	c2 timestamp
);
insert into ts_table (c1, c2) values (1, to_timestamp('2005-06-29 19:19:41', 'YYYY-MM-dd HH24:MI:SS'));

Attached TestTs.zip with sample class for testing the behavior.

Obtained results (local time zone is Europe/Madrid)

Oracle...
Timestamp at default tz (Central European Standard Time): 2005-06-29 19:19:41.0
Timestamp at America/Los_Angeles tz: 2005-06-30 04:19:41.0

SQL Server...
Timestamp at default tz (Central European Standard Time): 2005-06-29 19:19:41.0
Timestamp at America/Los_Angeles tz: 2005-06-30 04:19:41.0

Flight SQL (16.1.0)
Timestamp at default tz (Central European Standard Time): 2005-06-29 17:19:41.0
Timestamp at America/Los_Angeles tz: 2005-06-30 02:19:41.0

Flight SQL (fix )
Timestamp at default tz (Central European Standard Time): 2005-06-29 19:19:41.0
Timestamp at America/Los_Angeles tz: 2005-06-30 04:19:41.0

Note than result with America/Los_Angeles prints the timestamp on my machine (which is at Europe/Madrid) with a difference of 9 hours. That is correct, Europe/Madrid is 9 hours forward than America/Los_Angeles`.

With Flight-SQL JDBC driver 16.1.0 version results are wrong. With these changes returned timestamp is fine.

Are there any user-facing changes?

TestTs.zip

jhmannok and others added 9 commits July 4, 2024 15:40
 - Old behaviour treats the calendar object passed into the corresponding methods to be the timezone to convert the data into, which is incorrect according to what is defined in the JDBC API
 - Fixed: Correct behaviour is that the calendar object is used to extend the data into the timezone of the calendar object, essentially asserting that the data is at the timezone defined in the calendar object
 - Fixed: for vectors that do store timezone information (ie. TimestampVector), the getter methods will use the timezone defined in vector as the timezone assertion and ignores the calendar object if one was passed in

(cherry picked from commit 9d02bfd)
(cherry picked from commit 77e037b)
…Test

- Made timezone converting logic more readable in ArrowFlightJdbcTimeStampVectorAccessor
- Fixed checkstyle issues

(cherry picked from commit 5887433)
- Refactored getTimestampForVector() to be more concise and readable

(cherry picked from commit 1ff3fc6)
…one an if statement and changed all nonNull usage to compare with null explicitly

ArrowFlightJdbcTimeStampVectorAccessorTest: Modified test so system timezone is compared more explicitly

(cherry picked from commit f408486)
@scruz-denodo scruz-denodo requested a review from lidavidm as a code owner July 4, 2024 15:19
Copy link

github-actions bot commented Jul 4, 2024

⚠️ GitHub issue #36518 has no components, please add labels for components.

@scruz-denodo
Copy link
Author

Hi, I changed to draft the PR. I was taking a look now to date types and it is returning wrong results. The changes affects also to that type.

…return Timestamp objects with date and time that corresponds with local time instead of UTC date and time

Fixed truncation of millis at DateTimeUtils
Fixed date values returned when the vector was DateVector which was affected by changes at common method.
@scruz-denodo
Copy link
Author

Fixed the proble with Date. Also a truncation to seconds which was affecting to times.

@scruz-denodo scruz-denodo marked this pull request as ready for review July 5, 2024 11:48
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @jduo as well


final LocalDateTime localDateTime =
DateUtility.getLocalDateTimeFromEpochMilli(this.timeUnit.toMillis(holder.value));
final ZoneId defaultTimeZone = Calendar.getInstance().getTimeZone().toZoneId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to use the system locale? It should just be UTC (even that is strictly wrong; the date is unzoned by default and only zoned when a calendar comes into play)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does JDBC expect that it will be in the system timezone?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, problem here is that java.sql.Timestamp and java.sql.Date only contains a long value with milliseconds. You have to tell java how to print this long, in what timezone.

So, the method has to follow the same approach done at org.apache.arrow.driver.jdbc.accessor.impl.calendar.ArrowFlightJdbcTimeStampVectorAccessor#getLocalDateTime

  • first putting the received value (which is in UTC) at the expected time zone. The expected time zone is the default one for the JVM or the time zone specified when calling to java.sql.ResultSet#getDate(java.lang.String, java.util.Calendar). This is done on this line.
    .atZone(sourceTimeZone)
  • then, the value is adjusted to the JVM timezone.
    .withZoneSameInstant(defaultTimeZone)

When this method is used java.sql.ResultSet#getDate(java.lang.String, java.util.Calendar), it is expected that the caller prints the date returned on the specified time zone. Something like:

final Date dateValue = rs.getDate("date_field", Calendar.getInstance(TimeZone.getTimeZone("Asia/Tokyo")));
final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
sdf.setTimeZone(TimeZone.getTimeZone("Asia/Tokyo"));
System.out.println("Received date: " +sdf.format(dateValue));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I found a little time to look into this finally.

I still don't see why the system timezone has to come into play.

Arrow date is with reference to the UNIX epoch, i.e. UTC. Java Date is with reference to GMT. As the Java docs themselves state, they are technically different but practically the same.

(1) If there is no calendar given, I would expect there to be no datetime arithmetic done, because they are in the same time zone (effectively). It happens to work out here because we convert to/from the system timezone, so the arithmetic here should cancel itself out, but I'm not sure if that will actually be the case around daylight savings time boundaries.

(2) If there is a calendar given, JDBC is frustratingly vague about the intention of the parameter. It seems most sources interpret it as, "treat the date value as a local date (without reference to any timezone) and give me a timestamp that would display that date (at midnight) in the given timezone". In that case, again, I don't see why the system timezone should come into play.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
initially, you have a timestamp at database which does not have an associated time zone. It can be in any of them.

Check the example I attached before. You have '2005-06-29 19:19:41' at database.

  • your JVM is in America/Los_Angeles and you invoke the getTimestamp(int columnIndex). At ArrowFlightJdbcTimeStampVectorAccessor you receives that exact timestamp in UTC. But, you have to translate it to the same timestamp value but at America/Los_Angeles (the default JVM timezone).

    If you print the obtained timestamp, you will see the same value that was at database.
    Something like:
    2005-06-29 19:19:41.0

  • your JVM is in America/Los_Angeles and you invoke the getTimestamp(int columnIndex, Calendar cal) specifying Asia/Tokyo. Same as before, you receive the timestamp in UTC at ArrowFlightJdbcTimeStampVectorAccessor, but now you have to put that same timestamp in Asia/Tokyo time zone, because is the one specified at the getTimestamp invokation.

    When you print the obtained timestamp, you will see this:
    2005-06-29 03:19:41.0

    This is because the timestamp is printed following your JVM timezone (America/Los_Angeles). But note that it is the expected value because you retrieved the timestamp using getTimestamp(index, Asia/Tokyo-calendar). So, you specified that the timestamp '2005-06-29 19:19:41' is at the time zone Asia/Tokyo.

    If you check the difference, there are 16 hours less, because America/Los_Angeles is at GMT-7 and Asia/Tokyo is at UTC+9. This is why the timestamp is printed as 2005-06-29 03:19:41.0, America/Los_Angeles value, which is '2005-06-29 19:19:41.0' at Asia/Tokyo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeesh, system timezone appears to be a convention (e.g. this StackOverflow answer) but doesn't appear to be formally specified. I still think that, given Arrow actually specifies the epoch for its values, introducing the system timezone merely to be consistent with databases that appear not to store this information is strictly wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it could depend on how you want to read this sentence...

To conform with the definition of SQL DATE, the millisecond values wrapped by a java.sql.Date instance must be 'normalized' by setting the hours, minutes, seconds, and milliseconds to zero in the particular time zone with which the instance is associated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResultSet#getDate never specifies the time zone when a calendar is not provided, so I guess the question is whether we should use the underlying value's time zone or the system time zone

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I will be out for a few days. I will try to answer questions when I return.
In the meantime, I would like to comment:

Yeesh, system timezone appears to be a convention (e.g. this StackOverflow answer) but doesn't appear to be formally specified. I still think that, given Arrow actually specifies the epoch for its values, introducing the system timezone merely to be consistent with databases that appear not to store this information is strictly wrong.

It is true that documentation is not 100% clear, but it is also true that other JDBC drivers from important vendors work like that, using the default JVM timezone. So, a generic JDBC client will expect the data on that way.

I'm talking about Date here still.
(1) Arrow's Date is defined in terms of the Unix epoch (i.e. UTC), and (2) java.sql.Date is also defined in terms of GMT. So it doesn't seem like the system timezone should be relevant.

I will check again the code. But the problem is the same. The java.sql.Date has also hours, minutes and seconds. So, the Arrow JDBC driver retrieves the value in UTC, ok, but when a new Date is created it is done on the default JVM timezone. So, it can generate a different date than the expected. This is why a similar translation than the applied with Timestamp is required.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that python adbc driver is also working as proposed fix.

(venv) >test_dates.py

SQL: SELECT * FROM ts_table
  c1                  c2
0  1 **2005-06-29 19:19:41**
Time 0 seconds

test_date.zip

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jul 8, 2024
@scruz-denodo
Copy link
Author

Hi, I see current status as awaiting changes. What are the required changes?

@lidavidm
Copy link
Member

The bot does that anytime I comment, it's waiting on me (or James) to have enough time to review thoroughly

@scruz-denodo
Copy link
Author

The bot does that anytime I comment, it's waiting on me (or James) to have enough time to review thoroughly

Ok thanks. I was only asking just in case you were waiting for changes from my side.

@lidavidm
Copy link
Member

The Java code was moved to a different repo, see apache/arrow-java#464

@lidavidm lidavidm closed this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants