-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add ability to set custom host per dataset #844
Conversation
The use case for this is the querylog dataset which will have only one host.
b418b4e
to
b904492
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a couple of issues to address
snuba/settings.py
Outdated
# Overrides the default values for the specified dataset | ||
CLICKHOUSE_HOST_BY_DATASET: Mapping[str, str] = {} | ||
CLICKHOUSE_PORT_BY_DATASET: Mapping[str, int] = {} | ||
CLICKHOUSE_HTTP_PORT_BY_DATASET: Mapping[str, int] = {} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we please have a Mapping[str, NamedTuple] like:
CLICKHOUSE_DATASET_CONNECTIONS: {
"events": ClickhouseConnectionConfig(host="asdasd", port=1234),
"events": ClickhouseConnectionConfig(host="asdasd", port=1234),
}
CLICKHOUSE_DEFAULT_CONNECTION = ClickhosueConnectionConfig(....)
This will make it structurally impossible to have an invalid configuration where the dataset is present and the port is missing. At the same time it will make it easier to add new parameters if needed.
|
||
reader: Reader[ClickhouseQuery] = NativeDriverReader(clickhouse_ro) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the reader reusable was intentional. This change reverts that.
There is some context here.
#780
I think you can decide whether to create all of them upfront or to lazily initialize the Reader and writer object but they should be reusable and not instantiated every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to make this reader a property of dataset, but need to figure out how to do this and avoid the circular dependency NativeDriverReader <- Dataset <- ClickhouseQuery <- NativeDriverReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked closely enough at this to be able to give any specific feedback/direction on how to get there, but if it's useful here, my thinking was that the reader and writer would eventually be associated with a storage instead of a dataset. I don't know how this plays into the broader context of joins, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lynnagara you are right, I forgot about that. ClickhouseQuery still depends on dataset till the AST is done because, guess what? column_expr
https://github.com/getsentry/snuba/blob/master/snuba/clickhouse/query.py#L51
It is not there to stay but I don't think you can remove that dependency just yet.
Sorry about the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I "solved" this by evicting the DictQuery to a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked closely enough at this to be able to give any specific feedback/direction on how to get there, but if it's useful here, my thinking was that the reader and writer would eventually be associated with a storage instead of a dataset. I don't know how this plays into the broader context of joins, though.
After more thought, I think I was wrong about this — the reader has to be able to reference multiple storages to support joins, so associating the reader with the storage is an oversimplification.
If we introduce clusters (or databases, or whatever a better noun is) to the data model like this (highlighted section is new), it would probably make sense to associate the reader with the cluster instead to allow joins:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense (the cluster idea). It is critical to figure out sooner than later how that fits in the query execution strategy. The code that loads the reader today does not know about storages (this decision happened before) https://github.com/getsentry/snuba/blob/master/snuba/web/query.py#L23.
This is something that has to be solved joins or not since the last data structure that knows about the storages is the StorageQueryPlanBuilder so there is a gap as soon as we have multiple clusters. The StorageQueryPlanBuilder though picks the execution strategy so it is meant to influence a lot of parameters of the query executions. I would start from there to see how to let the query execution code pick the right cluster.
snuba/datasets/cdc/groupassignee.py
Outdated
def get_bulk_loader(self, source: BulkLoadSource, dest_table: str, clickhouse_client: ClickhousePool): | ||
return SingleTableBulkLoader( | ||
source=source, | ||
source_table=self.__postgres_table, | ||
dest_table=dest_table, | ||
clickhouse_client=clickhouse_client, | ||
row_processor=lambda row: GroupAssigneeRow.from_bulk(row).to_clickhouse(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not sound right. If the clickhouse pool is now a property of a dataset (indirectly but it is still conceptually a property of the dataset), why would the tableWriter need to be fed with the right clickhouse client.
At least we should initialize the TableWriter with the right clickhouse_client which is basically defined there and should not change.
But more broadly I would advise, since, with this change, you are effectively making the clickhouse client a property of the dataset, you try to make the Storage class (which is now committed) provide the reader and the WritableStorage provide the Writer that can be properly initialized inside.
I believe we should have access to the storage and to the dataset in every place where we now need this, but I may be wrong.
A few thoughts from my perspective, in no particular order:
|
I've moved the connection/reader into the dataset now, they are created on initialisation and reused thereafter. I'm reluctant to push it into Storage right now since I think that will need a large refactor and Filippo has ongoing work on this topic. They can still be overriden in the optimize/cleanup commands (I assume that will still be useful and serve a purpose that cannot just come from settings). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether you are planning to move the reader/writer into the storage, the comments should apply anyway.
logger = logging.getLogger("snuba.clickhouse") | ||
|
||
|
||
class ClickhousePool(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this move also motivated by a circular dependency to break? If yes, which one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same one, since it was in the same module as NativeDriverReader before
def get_clickhouse_rw(self) -> ClickhousePool: | ||
return self.__clickhouse_rw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether you move the writers/readers into the storage abstraction in this PR or not, I think the writer at least should should come from the TableWriter (either directly or through stream_loader and replacer. I suspect there will be some debate there). Adding it to the dataset at this point, reinstates a direct dependency from the consumer/replacer to the dataset object, while we are almost done making the two only depend on the dataset to retrieve the writer and replacer.
clickhouse_connection_config.port, | ||
client_settings={"readonly": True} | ||
) | ||
self.__reader = NativeDriverReader(self.__clickhouse_ro) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to still be a problem with the HTTP reader (which is stateful thus was supposed to be shared) for two reasons:
- Now you would not have a single instance but one per dataset
- There is no structural guarantee that datasets are singleton. They are supposed to be fully stateless and not make this assumption with the current architecture, they just happen to be singleton but this should not be taken for granted. None guarantees this. It is a guarantee we could consider if useful, but it is not the case now.
So an easy way for this is to instantiate the reader singleton elsewhere and reference it from the dataset, so you preserve the current state whether the number of instances of a dataset does not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this into the Storage, but I suspect one per storage is still unacceptable. In that case, we can have a mapping between storages and clusters and have one per cluster
@@ -97,7 +98,7 @@ class EventsDataset(TimeSeriesDataset): | |||
and the particular quirks of storing and querying them. | |||
""" | |||
|
|||
def __init__(self) -> None: | |||
def __init__(self, clickhouse_connection_config: ClickhouseConnectionConfig) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for passing the config during instantiation instead of letting the dataset code pick up the right config from settings? This seems confusing because:
- whoever needs to get a dataset instance would have to pass a clickhouse config, even if they only need to deal with the datamodel and not run any query.
- the config passed in input is not really honored since the dataset instance is cached at the first access and not built again (that's why we ensure get_dataset does not take input parameters other than the dataset name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an easy way to facilitate overriding the values from the cli commands. That'll probably still need to end up an option here, but yes it can be none most of the time and pickup default values from settigns.
Going to overhaul this to make the hosts based on storage instead of dataset. |
Closing because this is the wrong approach |
The use case for this is the querylog dataset which will have only one
host.