-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref: Add HTTPReader #406
ref: Add HTTPReader #406
Conversation
remark about error conditions: when we send an invalid query to clickhouse, it will reply with HTTP 500 with the http interface. We will have to properly parse the response in these cases as well otherwise those error messages, that now are visible in sentry, will be lost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is "ready for review" in the sense that this is ready for some though about the questions noted inline.
snuba/reader.py
Outdated
def __init__( | ||
self, host: str, port: int, options: Optional[Mapping[str, str]] = None | ||
): | ||
self.__base_url = f"http://{host}:{port}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3. 🙌
snuba/reader.py
Outdated
response = requests.post( | ||
urljoin(self.__base_url, "?" + urlencode(parameters)), | ||
data=query.encode("utf-8"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably use straight urllib3
, but I used requests
here for consistency with the HTTPBatchWriter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not anymore @tkaemming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used 97b36c7 as the pattern for this change. Probably need to take a closer look at request headers.
snuba/util.py
Outdated
@@ -444,7 +444,11 @@ def raw_query(body, sql, client, timer, stats=None): | |||
query_settings['max_threads'] = 1 | |||
|
|||
try: | |||
result = NativeDriverReader(client).execute( | |||
result = HTTPReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably need a way to find a way to switch these out at runtime, since this is a bit more touchy than the writer change. The only way that I've thought of testing this (short of Matt doing some wacky Kubernetes stuff) is to (random?) sample in requests based on dynamic configurations. Looking for additional ideas here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing something like serviceDelegator and run both for a small number of queries? That will provide you a more solid validation in a shorter time. The fact reader is a class on its own helps:
class DelegatingReader:
def __init__(self, native, http):
...
def execute ():
if random < settings.EXPERIMENT_SAMPLE:
native.execute
http.execute
compare and log discrepancy
return native result if discrepancy is too big
else:
if settings.PROD_READER = 'native:
return native.execute
else:
return http.execute
You can add a fallback to native in case of error on http so we do not take down Snuba.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with this is that we don't run uwsgi with threads enabled. I don't specifically recall why (#312) but I think we had to turn them off for some reason?
If we just want to spot check the results, we could possibly try running some sample of queries that were logged to Kafka with a consumer that runs them with both readers? This would also be outside of the request/response cycle, which I think is a benefit in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do run with threads now. We explicitly do it in Kubernetes config, and a part of the mywsgi changes we’re making that default. It’s silly to run with the GIL disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering we are already using the HTTP interface for writes I would try to keep this experiment simple. I can think about two possible issues we are trying to protect against:
- results are incorrect. For this we need to run both queries and compare, so either we run both queries within the api call or outside by running the queries logged into Kafka. If we are already logging into Kafka, why not writing a quick consumer that runs both queries and compare and try to consume that topic? That seems quicker than introducing a DelegatingReader in Snuba
- performance are dismal. For this we could (I don;'t know how much of a pain it is with our Kube deployment) do canary instead ? Deploy the api with HTTP reader on just a few node and monitor those?
- HTTP reader makes Snuba unstable. See "performance are dismal". If we could do canary we would not need a DelegatingReader for this as well.
What is concerning you the most?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super worried about performance or stability, since I think that would show up pretty quickly in the metrics or in Sentry and a rollback would be pretty straightforward. My biggest concern is subtle inconsistencies in the data returned between the two drivers that causes weird behavior that isn't easily identified by any specific metrics close to the source of the problem.
I think the only drawback of running a canary for performance and stability sanity checking would be the quantity of work required to enable that, which I can't speak to. Maybe @mattrobenolt could provide some context here?
If we are already logging into Kafka, why not writing a quick consumer that runs both queries and compare and try to consume that topic? That seems quicker than introducing a DelegatingReader in Snuba
Yeah, I agree this seems like the lowest impact way to do that kind of comparison right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't be hard if it's behind some environment variable that can be toggled. I can then just run 1 or more Pods with this on or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to also remove all the readonly accesses to ClickhousePool and replace them with the HTTPReader ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to also remove all the readonly accesses to ClickhousePool and replace them with the HTTPReader ?
Eventually, yes — not in a huge hurry to get to this but it should be done at some point for consistency, I think.
1e1401f
to
cc69f00
Compare
snuba/reader.py
Outdated
|
||
# TODO: Distinguish between ClickHouse errors and other HTTP errors. | ||
if response.status // 100 != 2: | ||
raise HTTPError(f"{response.status} Unexpected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include the error message in the response payload before this goes to production otherwise we would not be able to see Clickhouse errors in prod.
Also not having the clickhouse error would make things very hard to troubleshoot if something goes wrong when you roll this out.
snuba/util.py
Outdated
@@ -444,7 +444,11 @@ def raw_query(body, sql, client, timer, stats=None): | |||
query_settings['max_threads'] = 1 | |||
|
|||
try: | |||
result = NativeDriverReader(client).execute( | |||
result = HTTPReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing something like serviceDelegator and run both for a small number of queries? That will provide you a more solid validation in a shorter time. The fact reader is a class on its own helps:
class DelegatingReader:
def __init__(self, native, http):
...
def execute ():
if random < settings.EXPERIMENT_SAMPLE:
native.execute
http.execute
compare and log discrepancy
return native result if discrepancy is too big
else:
if settings.PROD_READER = 'native:
return native.execute
else:
return http.execute
You can add a fallback to native in case of error on http so we do not take down Snuba.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only blocker is the rollout plan.
def execute( | ||
self, | ||
query: str, | ||
settings: Optional[Mapping[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why writer does not take settings per query (in the write method) while reader does? If not I would add them in both for consistency in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there was a need for it at the call site, I'm fine with adding it though.
snuba/clickhouse/http.py
Outdated
body=query.encode("utf-8"), | ||
) | ||
|
||
handle_clickhouse_response(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit on naming: since this method either does nothing to the response or throws, why not following a naming convention like raise_for_status method. We could rename this to something like "assert_successful_response" that implies it will throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will come up with something better here.
handle_clickhouse_response(response) | ||
|
||
result = json.loads(response.data.decode("utf-8")) | ||
del result["statistics"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are removing these just not to expose Clickhouse internals, is that correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of, yeah — this is just to make the result structure consistent between both implementations (the native driver doesn't contain these fields.)
I think some of them might be helpful to use internally later (and can be fetched through other APIs with the native driver) but I want to avoid changing the return type too much here with this change. I also think we should probably try to avoid exposing them externally if possible.
snuba/util.py
Outdated
@@ -444,7 +444,11 @@ def raw_query(body, sql, client, timer, stats=None): | |||
query_settings['max_threads'] = 1 | |||
|
|||
try: | |||
result = NativeDriverReader(client).execute( | |||
result = HTTPReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering we are already using the HTTP interface for writes I would try to keep this experiment simple. I can think about two possible issues we are trying to protect against:
- results are incorrect. For this we need to run both queries and compare, so either we run both queries within the api call or outside by running the queries logged into Kafka. If we are already logging into Kafka, why not writing a quick consumer that runs both queries and compare and try to consume that topic? That seems quicker than introducing a DelegatingReader in Snuba
- performance are dismal. For this we could (I don;'t know how much of a pain it is with our Kube deployment) do canary instead ? Deploy the api with HTTP reader on just a few node and monitor those?
- HTTP reader makes Snuba unstable. See "performance are dismal". If we could do canary we would not need a DelegatingReader for this as well.
What is concerning you the most?
@@ -478,7 +478,7 @@ def raw_query(body, sql, client, timer, stats=None): | |||
error = str(ex) | |||
status = 500 | |||
logger.exception("Error running query: %s\n%s", sql, error) | |||
if isinstance(ex, ClickHouseError): | |||
if isinstance(ex, (NativeDriverClickHouseError, HTTPDriverClickHouseError)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a smell that this point that the Reader
doesn't raise a consistent error type.
settings.CLICKHOUSE_HOST, | ||
settings.CLICKHOUSE_HTTP_PORT, | ||
{'output_format_json_quote_64bit_integers': '0'}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to make this switchable, just haven't decided how yet.
This is also another thing that will require per-dataset configuration (since not every dataset is going to use the same host) but that's nothing new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the motivation to moving this to the module level is that the connection pool should probably only be instantiated once per process, doing this in raw_query
probably would have created a lot of unnecessary connections.
Also going to need to make sure this is set to use |
This has been replaced by #819 which is a much cleaner implementation of the same ideas. |
No description provided.