Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(reader): Improve abstractions around column transformation #812

Merged
merged 5 commits into from
Mar 4, 2020

Conversation

tkaemming
Copy link
Contributor

Second take at #800, now with timezone support. From the original PR description:

This transformation logic is going to need to be shared between the native reader and the HTTP reader, but the actual transformation logic for these types will be different since the native reader is dealing with Python objects, and the HTTP reader will be dealing with strings. This splits the transformation functions themselves from the code that manages the result mutation.

@tkaemming tkaemming requested a review from a team March 4, 2020 18:26
snuba/reader.py Outdated
Comment on lines 52 to 59
transformer = next(
(
transformer
for pattern, transformer in column_transformations
if pattern.match(column["type"])
),
None,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same caveat still applies here: #800 (comment)

Comment on lines +155 to +156
# XXX: If the original value had a valid nonzero UTC offset, this will
# result in an incorrect value being returned.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those "shouldn't happen" (but probably will eventually) comments. Mark my words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out why some dates had timezone from clickhouse and others did not ? I am curious.
Also should we log warning in case the result we are returning is incorrect because there was a valid non zero UTC offset in the parameter ?

Copy link
Contributor Author

@tkaemming tkaemming Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out why some dates had timezone from clickhouse and others did not ? I am curious.

It seems like certain functions that manipulate the time value result in the time having the timezone attached, as in these examples:

SELECT now() LIMIT 1 FORMAT JSON
{
	"meta":
	[
		{
			"name": "now()",
			"type": "DateTime"
		}
	],
	"data":
	[
		{
			"now()": "2020-03-04 22:57:42"
		}
	]
}

vs.

SELECT toStartOfHour(now()) LIMIT 1 FORMAT JSON
{
	"meta":
	[
		{
			"name": "toStartOfHour(now())",
			"type": "DateTime('Etc\/UTC')"
		}
	],
	"data":
	[
		{
			"toStartOfHour(now())": "2020-03-04 22:00:00"
		}
	]
}

In our case, this means there is going to be a difference between the values returned by the time synthetic column added by the time series dataset code, and the timestamp returned from a single row.

Also should we log warning in case the result we are returning is incorrect because there was a valid non zero UTC offset in the parameter ?

I thought about this too. I don't like that this will be silently adjusting (read: corrupting) data, but also the idea of logging within the value transformer seems like a bad idea since it'll put the log in a tight loop. This also seems to be inconsistently named between environments (in production it is Universal whereas I see Etc/UTC in my local environment.)

I think that the real answer is to just treat tz-naive and tz-aware datetimes differently. If there's no tzinfo, assume UTC and if there is tzinfo, either ensure (via asserting) that it's UTC, or always localize it to UTC regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the real answer is to just treat tz-naive and tz-aware datetimes differently. If there's no tzinfo, assume UTC and if there is tzinfo, either ensure (via asserting) that it's UTC, or always localize it to UTC regardless.

I'm going to do this as a follow up just to do some semi-dangerous thing at a time.

Comment on lines +155 to +156
# XXX: If the original value had a valid nonzero UTC offset, this will
# result in an incorrect value being returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out why some dates had timezone from clickhouse and others did not ? I am curious.
Also should we log warning in case the result we are returning is incorrect because there was a valid non zero UTC offset in the parameter ?

@tkaemming tkaemming merged commit cc9b320 into master Mar 4, 2020
@tkaemming tkaemming deleted the column-type-transformation branch March 4, 2020 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants