feat: Add ability to set custom host per dataset #844

lynnagara · 2020-03-18T20:47:13Z

The use case for this is the querylog dataset which will have only one
host.

The use case for this is the querylog dataset which will have only one host.

fpacifici

I think there are a couple of issues to address

fpacifici · 2020-03-19T00:06:47Z

snuba/settings.py

+# Overrides the default values for the specified dataset
+CLICKHOUSE_HOST_BY_DATASET: Mapping[str, str] = {}
+CLICKHOUSE_PORT_BY_DATASET: Mapping[str, int] = {}
+CLICKHOUSE_HTTP_PORT_BY_DATASET: Mapping[str, int] = {}
+


could we please have a Mapping[str, NamedTuple] like:

CLICKHOUSE_DATASET_CONNECTIONS: { "events": ClickhouseConnectionConfig(host="asdasd", port=1234), "events": ClickhouseConnectionConfig(host="asdasd", port=1234), } CLICKHOUSE_DEFAULT_CONNECTION = ClickhosueConnectionConfig(....)

This will make it structurally impossible to have an invalid configuration where the dataset is present and the port is missing. At the same time it will make it easier to add new parameters if needed.

fpacifici · 2020-03-19T00:10:37Z

snuba/environment.py


-reader: Reader[ClickhouseQuery] = NativeDriverReader(clickhouse_ro)


Making the reader reusable was intentional. This change reverts that.
There is some context here.
#780
I think you can decide whether to create all of them upfront or to lazily initialize the Reader and writer object but they should be reusable and not instantiated every time.

I'd like to make this reader a property of dataset, but need to figure out how to do this and avoid the circular dependency NativeDriverReader <- Dataset <- ClickhouseQuery <- NativeDriverReader

I haven't looked closely enough at this to be able to give any specific feedback/direction on how to get there, but if it's useful here, my thinking was that the reader and writer would eventually be associated with a storage instead of a dataset. I don't know how this plays into the broader context of joins, though.

@lynnagara you are right, I forgot about that. ClickhouseQuery still depends on dataset till the AST is done because, guess what? column_expr
https://github.com/getsentry/snuba/blob/master/snuba/clickhouse/query.py#L51
It is not there to stay but I don't think you can remove that dependency just yet.
Sorry about the confusion.

No worries, I "solved" this by evicting the DictQuery to a separate file.

I haven't looked closely enough at this to be able to give any specific feedback/direction on how to get there, but if it's useful here, my thinking was that the reader and writer would eventually be associated with a storage instead of a dataset. I don't know how this plays into the broader context of joins, though.

After more thought, I think I was wrong about this — the reader has to be able to reference multiple storages to support joins, so associating the reader with the storage is an oversimplification.

If we introduce clusters (or databases, or whatever a better noun is) to the data model like this (highlighted section is new), it would probably make sense to associate the reader with the cluster instead to allow joins:

This makes sense (the cluster idea). It is critical to figure out sooner than later how that fits in the query execution strategy. The code that loads the reader today does not know about storages (this decision happened before) https://github.com/getsentry/snuba/blob/master/snuba/web/query.py#L23.

This is something that has to be solved joins or not since the last data structure that knows about the storages is the StorageQueryPlanBuilder so there is a gap as soon as we have multiple clusters. The StorageQueryPlanBuilder though picks the execution strategy so it is meant to influence a lot of parameters of the query executions. I would start from there to see how to let the query execution code pick the right cluster.

fpacifici · 2020-03-19T00:22:21Z

snuba/datasets/cdc/groupassignee.py

+    def get_bulk_loader(self, source: BulkLoadSource, dest_table: str, clickhouse_client: ClickhousePool):
        return SingleTableBulkLoader(
            source=source,
            source_table=self.__postgres_table,
            dest_table=dest_table,
+            clickhouse_client=clickhouse_client,
            row_processor=lambda row: GroupAssigneeRow.from_bulk(row).to_clickhouse(),


This does not sound right. If the clickhouse pool is now a property of a dataset (indirectly but it is still conceptually a property of the dataset), why would the tableWriter need to be fed with the right clickhouse client.
At least we should initialize the TableWriter with the right clickhouse_client which is basically defined there and should not change.
But more broadly I would advise, since, with this change, you are effectively making the clickhouse client a property of the dataset, you try to make the Storage class (which is now committed) provide the reader and the WritableStorage provide the Writer that can be properly initialized inside.
I believe we should have access to the storage and to the dataset in every place where we now need this, but I may be wrong.

tkaemming · 2020-03-19T17:17:45Z

A few thoughts from my perspective, in no particular order:

The intention of the environment module was to store global state to the Snuba installation. It should not know about the concept of datasets if at all possible. (This is not documented and is probably not particularly clear at a glance, sorry about that.)
Ideally, datasets that reference the same cluster (specifically ClickHouse in this context, but this could also be Redis, Kafka, etc. as a pattern moving forward) would share the same client if possible. This could be done by assigning a name to the cluster, or based on matching addresses.
cleanup and optimize take --clickhouse-{host,port} arguments that override the defaults as well, that doesn't seem to be accounted here. It'd also be helpful (in my opinion at least) if we could get that cluster topology definition for cleanup out of the ops repository and into this project's configuration instead (mostly just to simplify the lives of on-premise/single-tenant users.)

lynnagara · 2020-03-24T01:01:14Z

I've moved the connection/reader into the dataset now, they are created on initialisation and reused thereafter. I'm reluctant to push it into Storage right now since I think that will need a large refactor and Filippo has ongoing work on this topic. They can still be overriden in the optimize/cleanup commands (I assume that will still be useful and serve a purpose that cannot just come from settings).

fpacifici

Not sure whether you are planning to move the reader/writer into the storage, the comments should apply anyway.

fpacifici · 2020-03-24T20:28:36Z

snuba/clickhouse/pool.py

+logger = logging.getLogger("snuba.clickhouse")
+
+
+class ClickhousePool(object):


Was this move also motivated by a circular dependency to break? If yes, which one ?

The same one, since it was in the same module as NativeDriverReader before

fpacifici · 2020-03-24T20:34:07Z

snuba/datasets/dataset.py

+    def get_clickhouse_rw(self) -> ClickhousePool:
+        return self.__clickhouse_rw


Whether you move the writers/readers into the storage abstraction in this PR or not, I think the writer at least should should come from the TableWriter (either directly or through stream_loader and replacer. I suspect there will be some debate there). Adding it to the dataset at this point, reinstates a direct dependency from the consumer/replacer to the dataset object, while we are almost done making the two only depend on the dataset to retrieve the writer and replacer.

fpacifici · 2020-03-24T20:52:04Z

snuba/datasets/dataset.py

+            clickhouse_connection_config.port,
+            client_settings={"readonly": True}
+        )
+        self.__reader = NativeDriverReader(self.__clickhouse_ro)


I think this is going to still be a problem with the HTTP reader (which is stateful thus was supposed to be shared) for two reasons:

Now you would not have a single instance but one per dataset

There is no structural guarantee that datasets are singleton. They are supposed to be fully stateless and not make this assumption with the current architecture, they just happen to be singleton but this should not be taken for granted. None guarantees this. It is a guarantee we could consider if useful, but it is not the case now.

So an easy way for this is to instantiate the reader singleton elsewhere and reference it from the dataset, so you preserve the current state whether the number of instances of a dataset does not matter.

I would move this into the Storage, but I suspect one per storage is still unacceptable. In that case, we can have a mapping between storages and clusters and have one per cluster

fpacifici · 2020-03-24T21:17:27Z

snuba/datasets/events.py

@@ -97,7 +98,7 @@ class EventsDataset(TimeSeriesDataset):
    and the particular quirks of storing and querying them.
    """

-    def __init__(self) -> None:
+    def __init__(self, clickhouse_connection_config: ClickhouseConnectionConfig) -> None:


What's the reasoning for passing the config during instantiation instead of letting the dataset code pick up the right config from settings? This seems confusing because:

whoever needs to get a dataset instance would have to pass a clickhouse config, even if they only need to deal with the datamodel and not run any query.

the config passed in input is not really honored since the dataset instance is cached at the first access and not built again (that's why we ensure get_dataset does not take input parameters other than the dataset name)

It was an easy way to facilitate overriding the values from the cli commands. That'll probably still need to end up an option here, but yes it can be none most of the time and pickup default values from settigns.

lynnagara · 2020-03-25T01:10:48Z

Going to overhaul this to make the hosts based on storage instead of dataset.

This simply moves the DictClickhouseQuery into a separate file from ClickhouseQuery. This has come up before in #844 and #860, and causes circular dependencies since DictClickhouseQuery (which will be deprecated) currently depends on Dataset.

lynnagara · 2020-04-16T18:28:55Z

Closing because this is the wrong approach

lynnagara requested a review from a team as a code owner March 18, 2020 20:47

feat: Add ability to set custom host per dataset

b904492

The use case for this is the querylog dataset which will have only one host.

lynnagara force-pushed the custom-dataset-host branch from b418b4e to b904492 Compare March 18, 2020 20:54

fpacifici requested changes Mar 19, 2020

View reviewed changes

lynnagara added 5 commits March 20, 2020 16:39

move client into dataset

d4d048c

move config class to a separate file

c8810c6

fix optimize

042bd1a

move reader into dataset

ad59b2b

Merge remote-tracking branch 'origin/master' into custom-dataset-host

1ea77fa

fpacifici requested changes Mar 24, 2020

View reviewed changes

tkaemming mentioned this pull request Mar 24, 2020

feat(reader): Add ClickHouse HTTPReader #819

Closed

fpacifici mentioned this pull request Mar 25, 2020

feat: Register every storage #852

Merged

lynnagara mentioned this pull request Apr 8, 2020

ref: Move DictClickhouseQuery #870

Merged

lynnagara closed this Apr 16, 2020

lynnagara deleted the custom-dataset-host branch April 16, 2020 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add ability to set custom host per dataset #844

feat: Add ability to set custom host per dataset #844

lynnagara commented Mar 18, 2020

fpacifici left a comment

fpacifici Mar 19, 2020

fpacifici Mar 19, 2020

lynnagara Mar 21, 2020

tkaemming Mar 23, 2020

fpacifici Mar 23, 2020

lynnagara Mar 23, 2020

tkaemming Mar 25, 2020 •

edited

Loading

fpacifici Mar 25, 2020

fpacifici Mar 19, 2020

tkaemming commented Mar 19, 2020

lynnagara commented Mar 24, 2020

fpacifici left a comment

fpacifici Mar 24, 2020

lynnagara Mar 25, 2020

fpacifici Mar 24, 2020

fpacifici Mar 24, 2020

lynnagara Mar 25, 2020

fpacifici Mar 24, 2020

lynnagara Mar 25, 2020

lynnagara commented Mar 25, 2020

lynnagara commented Apr 16, 2020


		reader: Reader[ClickhouseQuery] = NativeDriverReader(clickhouse_ro)

		logger = logging.getLogger("snuba.clickhouse")


		class ClickhousePool(object):

		def get_clickhouse_rw(self) -> ClickhousePool:
		return self.__clickhouse_rw

feat: Add ability to set custom host per dataset #844

feat: Add ability to set custom host per dataset #844

Conversation

lynnagara commented Mar 18, 2020

fpacifici left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkaemming Mar 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkaemming commented Mar 19, 2020

lynnagara commented Mar 24, 2020

fpacifici left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lynnagara commented Mar 25, 2020

lynnagara commented Apr 16, 2020

tkaemming Mar 25, 2020 •

edited

Loading