-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36518: [Java] Fix ArrowFlightJdbcTimeStampVectorAccessor to return Timestamp objects with date and time that corresponds with local time instead of UTC date and time #43149
Conversation
- Old behaviour treats the calendar object passed into the corresponding methods to be the timezone to convert the data into, which is incorrect according to what is defined in the JDBC API - Fixed: Correct behaviour is that the calendar object is used to extend the data into the timezone of the calendar object, essentially asserting that the data is at the timezone defined in the calendar object - Fixed: for vectors that do store timezone information (ie. TimestampVector), the getter methods will use the timezone defined in vector as the timezone assertion and ignores the calendar object if one was passed in (cherry picked from commit 9d02bfd)
(cherry picked from commit 77e037b)
…Test - Made timezone converting logic more readable in ArrowFlightJdbcTimeStampVectorAccessor - Fixed checkstyle issues (cherry picked from commit 5887433)
- Refactored getTimestampForVector() to be more concise and readable (cherry picked from commit 1ff3fc6)
…one an if statement and changed all nonNull usage to compare with null explicitly ArrowFlightJdbcTimeStampVectorAccessorTest: Modified test so system timezone is compared more explicitly (cherry picked from commit f408486)
(cherry picked from commit beb4ea6)
|
Hi, I changed to draft the PR. I was taking a look now to |
…return Timestamp objects with date and time that corresponds with local time instead of UTC date and time Fixed truncation of millis at DateTimeUtils Fixed date values returned when the vector was DateVector which was affected by changes at common method.
Fixed the proble with Date. Also a truncation to seconds which was affecting to times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @jduo as well
|
||
final LocalDateTime localDateTime = | ||
DateUtility.getLocalDateTimeFromEpochMilli(this.timeUnit.toMillis(holder.value)); | ||
final ZoneId defaultTimeZone = Calendar.getInstance().getTimeZone().toZoneId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to use the system locale? It should just be UTC (even that is strictly wrong; the date is unzoned by default and only zoned when a calendar comes into play)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does JDBC expect that it will be in the system timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, problem here is that java.sql.Timestamp
and java.sql.Date
only contains a long value with milliseconds. You have to tell java how to print this long, in what timezone.
So, the method has to follow the same approach done at org.apache.arrow.driver.jdbc.accessor.impl.calendar.ArrowFlightJdbcTimeStampVectorAccessor#getLocalDateTime
- first putting the received value (which is in UTC) at the expected time zone. The expected time zone is the default one for the JVM or the time zone specified when calling to
java.sql.ResultSet#getDate(java.lang.String, java.util.Calendar)
. This is done on this line.
.atZone(sourceTimeZone)
- then, the value is adjusted to the JVM timezone.
.withZoneSameInstant(defaultTimeZone)
When this method is used java.sql.ResultSet#getDate(java.lang.String, java.util.Calendar)
, it is expected that the caller prints the date returned on the specified time zone. Something like:
final Date dateValue = rs.getDate("date_field", Calendar.getInstance(TimeZone.getTimeZone("Asia/Tokyo")));
final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
sdf.setTimeZone(TimeZone.getTimeZone("Asia/Tokyo"));
System.out.println("Received date: " +sdf.format(dateValue));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I found a little time to look into this finally.
I still don't see why the system timezone has to come into play.
Arrow date is with reference to the UNIX epoch, i.e. UTC. Java Date is with reference to GMT. As the Java docs themselves state, they are technically different but practically the same.
(1) If there is no calendar given, I would expect there to be no datetime arithmetic done, because they are in the same time zone (effectively). It happens to work out here because we convert to/from the system timezone, so the arithmetic here should cancel itself out, but I'm not sure if that will actually be the case around daylight savings time boundaries.
(2) If there is a calendar given, JDBC is frustratingly vague about the intention of the parameter. It seems most sources interpret it as, "treat the date value as a local date (without reference to any timezone) and give me a timestamp that would display that date (at midnight) in the given timezone". In that case, again, I don't see why the system timezone should come into play.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
initially, you have a timestamp at database which does not have an associated time zone. It can be in any of them.
-
when you invoke the method getTimestamp(int columnIndex), you are saying that the timestamp you are getting is at the default timezone from the JVM.
-
when you invoke the method getTimestamp(int columnIndex, Calendar cal), you are saying that the timestamp you are getting is at the timezone specified at input
cal
parameter.
Check the example I attached before. You have '2005-06-29 19:19:41' at database.
-
your JVM is in
America/Los_Angeles
and you invoke the getTimestamp(int columnIndex). AtArrowFlightJdbcTimeStampVectorAccessor
you receives that exact timestamp in UTC. But, you have to translate it to the same timestamp value but atAmerica/Los_Angeles
(the default JVM timezone).If you print the obtained timestamp, you will see the same value that was at database.
Something like:
2005-06-29 19:19:41.0
-
your JVM is in
America/Los_Angeles
and you invoke the getTimestamp(int columnIndex, Calendar cal) specifyingAsia/Tokyo
. Same as before, you receive the timestamp in UTC atArrowFlightJdbcTimeStampVectorAccessor
, but now you have to put that same timestamp inAsia/Tokyo
time zone, because is the one specified at thegetTimestamp
invokation.When you print the obtained timestamp, you will see this:
2005-06-29 03:19:41.0
This is because the timestamp is printed following your JVM timezone (America/Los_Angeles). But note that it is the expected value because you retrieved the timestamp using
getTimestamp(index, Asia/Tokyo-calendar)
. So, you specified that the timestamp '2005-06-29 19:19:41' is at the time zoneAsia/Tokyo
.If you check the difference, there are 16 hours less, because
America/Los_Angeles
is at GMT-7 andAsia/Tokyo
is at UTC+9. This is why the timestamp is printed as2005-06-29 03:19:41.0
,America/Los_Angeles
value, which is '2005-06-29 19:19:41.0' atAsia/Tokyo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeesh, system timezone appears to be a convention (e.g. this StackOverflow answer) but doesn't appear to be formally specified. I still think that, given Arrow actually specifies the epoch for its values, introducing the system timezone merely to be consistent with databases that appear not to store this information is strictly wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could depend on how you want to read this sentence...
To conform with the definition of SQL DATE, the millisecond values wrapped by a java.sql.Date instance must be 'normalized' by setting the hours, minutes, seconds, and milliseconds to zero in the particular time zone with which the instance is associated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResultSet#getDate never specifies the time zone when a calendar is not provided, so I guess the question is whether we should use the underlying value's time zone or the system time zone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I will be out for a few days. I will try to answer questions when I return.
In the meantime, I would like to comment:
Yeesh, system timezone appears to be a convention (e.g. this StackOverflow answer) but doesn't appear to be formally specified. I still think that, given Arrow actually specifies the epoch for its values, introducing the system timezone merely to be consistent with databases that appear not to store this information is strictly wrong.
It is true that documentation is not 100% clear, but it is also true that other JDBC drivers from important vendors work like that, using the default JVM timezone. So, a generic JDBC client will expect the data on that way.
I'm talking about Date here still.
(1) Arrow's Date is defined in terms of the Unix epoch (i.e. UTC), and (2) java.sql.Date is also defined in terms of GMT. So it doesn't seem like the system timezone should be relevant.
I will check again the code. But the problem is the same. The java.sql.Date
has also hours, minutes and seconds. So, the Arrow JDBC driver retrieves the value in UTC, ok, but when a new Date
is created it is done on the default JVM timezone. So, it can generate a different date than the expected. This is why a similar translation than the applied with Timestamp
is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that python adbc driver is also working as proposed fix.
(venv) >test_dates.py
SQL: SELECT * FROM ts_table
c1 c2
0 1 **2005-06-29 19:19:41**
Time 0 seconds
Hi, I see current status as |
The bot does that anytime I comment, it's waiting on me (or James) to have enough time to review thoroughly |
Ok thanks. I was only asking just in case you were waiting for changes from my side. |
The Java code was moved to a different repo, see apache/arrow-java#464 |
Rationale for this change
I opened this PR for continuing the work done at this one #36519.
I copy the description from there.
When calling the getTimestamp method from the ArrowFlightJdbcTimeStampVectorAccessor class, the timezone of the Timestamp object returned is incorrect. The timestamp itself appears to be in GMT/UTC time but the timezone field of the Timestamp object is populated with the timezone of the JDBC client instead.
Example
timestamp on db: 2021-03-28T00:15:00.000 (UTC)
timezone of JDBC client: PST/PDT (Vancouver Time)
Calendar cal <- Calendar with UTC timezone
calling getTimestamp(cal) on a result set will return a timestamp like this: 2021-03-28T00:15:00.000 (PST/PDT)
where 2021-03-28T00:15:00.000 appears to be in UTC time but the timezone of the object itself is PST/PDT
What changes are included in this PR?
Are these changes tested?
Many of the time related tests also face this issue and this fixes them all. Also tested the driver jar in a JDBC client and behaviour is fixed.
Also tested behavior with Oracle and SQLServer.
Test table
Attached
TestTs.zip
with sample class for testing the behavior.Obtained results (local time zone is
Europe/Madrid
)Note than result with
America/Los_Angeles
prints the timestamp on my machine (which is at Europe/Madrid) with a difference of 9 hours. That is correct,Europe/Madrid is 9 hours forward than
America/Los_Angeles`.With Flight-SQL JDBC driver 16.1.0 version results are wrong. With these changes returned timestamp is fine.
Are there any user-facing changes?
TestTs.zip