Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ability to set custom host per dataset #844

Closed
wants to merge 6 commits into from

Conversation

lynnagara
Copy link
Member

The use case for this is the querylog dataset which will have only one
host.

@lynnagara lynnagara requested a review from a team as a code owner March 18, 2020 20:47
The use case for this is the querylog dataset which will have only one
host.
@lynnagara lynnagara force-pushed the custom-dataset-host branch from b418b4e to b904492 Compare March 18, 2020 20:54
Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple of issues to address

Comment on lines 27 to 31
# Overrides the default values for the specified dataset
CLICKHOUSE_HOST_BY_DATASET: Mapping[str, str] = {}
CLICKHOUSE_PORT_BY_DATASET: Mapping[str, int] = {}
CLICKHOUSE_HTTP_PORT_BY_DATASET: Mapping[str, int] = {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we please have a Mapping[str, NamedTuple] like:

CLICKHOUSE_DATASET_CONNECTIONS: {
   "events": ClickhouseConnectionConfig(host="asdasd", port=1234),
   "events": ClickhouseConnectionConfig(host="asdasd", port=1234),
}
CLICKHOUSE_DEFAULT_CONNECTION = ClickhosueConnectionConfig(....)

This will make it structurally impossible to have an invalid configuration where the dataset is present and the port is missing. At the same time it will make it easier to add new parameters if needed.


reader: Reader[ClickhouseQuery] = NativeDriverReader(clickhouse_ro)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the reader reusable was intentional. This change reverts that.
There is some context here.
#780
I think you can decide whether to create all of them upfront or to lazily initialize the Reader and writer object but they should be reusable and not instantiated every time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make this reader a property of dataset, but need to figure out how to do this and avoid the circular dependency NativeDriverReader <- Dataset <- ClickhouseQuery <- NativeDriverReader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely enough at this to be able to give any specific feedback/direction on how to get there, but if it's useful here, my thinking was that the reader and writer would eventually be associated with a storage instead of a dataset. I don't know how this plays into the broader context of joins, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lynnagara you are right, I forgot about that. ClickhouseQuery still depends on dataset till the AST is done because, guess what? column_expr
https://github.com/getsentry/snuba/blob/master/snuba/clickhouse/query.py#L51
It is not there to stay but I don't think you can remove that dependency just yet.
Sorry about the confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I "solved" this by evicting the DictQuery to a separate file.

Copy link
Contributor

@tkaemming tkaemming Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely enough at this to be able to give any specific feedback/direction on how to get there, but if it's useful here, my thinking was that the reader and writer would eventually be associated with a storage instead of a dataset. I don't know how this plays into the broader context of joins, though.

After more thought, I think I was wrong about this — the reader has to be able to reference multiple storages to support joins, so associating the reader with the storage is an oversimplification.

If we introduce clusters (or databases, or whatever a better noun is) to the data model like this (highlighted section is new), it would probably make sense to associate the reader with the cluster instead to allow joins:

Screen Shot 2020-03-25 at 2 52 42 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense (the cluster idea). It is critical to figure out sooner than later how that fits in the query execution strategy. The code that loads the reader today does not know about storages (this decision happened before) https://github.com/getsentry/snuba/blob/master/snuba/web/query.py#L23.

This is something that has to be solved joins or not since the last data structure that knows about the storages is the StorageQueryPlanBuilder so there is a gap as soon as we have multiple clusters. The StorageQueryPlanBuilder though picks the execution strategy so it is meant to influence a lot of parameters of the query executions. I would start from there to see how to let the query execution code pick the right cluster.

Comment on lines 28 to 34
def get_bulk_loader(self, source: BulkLoadSource, dest_table: str, clickhouse_client: ClickhousePool):
return SingleTableBulkLoader(
source=source,
source_table=self.__postgres_table,
dest_table=dest_table,
clickhouse_client=clickhouse_client,
row_processor=lambda row: GroupAssigneeRow.from_bulk(row).to_clickhouse(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not sound right. If the clickhouse pool is now a property of a dataset (indirectly but it is still conceptually a property of the dataset), why would the tableWriter need to be fed with the right clickhouse client.
At least we should initialize the TableWriter with the right clickhouse_client which is basically defined there and should not change.
But more broadly I would advise, since, with this change, you are effectively making the clickhouse client a property of the dataset, you try to make the Storage class (which is now committed) provide the reader and the WritableStorage provide the Writer that can be properly initialized inside.
I believe we should have access to the storage and to the dataset in every place where we now need this, but I may be wrong.

@tkaemming
Copy link
Contributor

A few thoughts from my perspective, in no particular order:

  1. The intention of the environment module was to store global state to the Snuba installation. It should not know about the concept of datasets if at all possible. (This is not documented and is probably not particularly clear at a glance, sorry about that.)
  2. Ideally, datasets that reference the same cluster (specifically ClickHouse in this context, but this could also be Redis, Kafka, etc. as a pattern moving forward) would share the same client if possible. This could be done by assigning a name to the cluster, or based on matching addresses.
  3. cleanup and optimize take --clickhouse-{host,port} arguments that override the defaults as well, that doesn't seem to be accounted here. It'd also be helpful (in my opinion at least) if we could get that cluster topology definition for cleanup out of the ops repository and into this project's configuration instead (mostly just to simplify the lives of on-premise/single-tenant users.)

@lynnagara
Copy link
Member Author

I've moved the connection/reader into the dataset now, they are created on initialisation and reused thereafter. I'm reluctant to push it into Storage right now since I think that will need a large refactor and Filippo has ongoing work on this topic. They can still be overriden in the optimize/cleanup commands (I assume that will still be useful and serve a purpose that cannot just come from settings).

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether you are planning to move the reader/writer into the storage, the comments should apply anyway.

logger = logging.getLogger("snuba.clickhouse")


class ClickhousePool(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this move also motivated by a circular dependency to break? If yes, which one ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same one, since it was in the same module as NativeDriverReader before

Comment on lines +182 to +183
def get_clickhouse_rw(self) -> ClickhousePool:
return self.__clickhouse_rw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether you move the writers/readers into the storage abstraction in this PR or not, I think the writer at least should should come from the TableWriter (either directly or through stream_loader and replacer. I suspect there will be some debate there). Adding it to the dataset at this point, reinstates a direct dependency from the consumer/replacer to the dataset object, while we are almost done making the two only depend on the dataset to retrieve the writer and replacer.

clickhouse_connection_config.port,
client_settings={"readonly": True}
)
self.__reader = NativeDriverReader(self.__clickhouse_ro)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to still be a problem with the HTTP reader (which is stateful thus was supposed to be shared) for two reasons:

  • Now you would not have a single instance but one per dataset
  • There is no structural guarantee that datasets are singleton. They are supposed to be fully stateless and not make this assumption with the current architecture, they just happen to be singleton but this should not be taken for granted. None guarantees this. It is a guarantee we could consider if useful, but it is not the case now.

So an easy way for this is to instantiate the reader singleton elsewhere and reference it from the dataset, so you preserve the current state whether the number of instances of a dataset does not matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this into the Storage, but I suspect one per storage is still unacceptable. In that case, we can have a mapping between storages and clusters and have one per cluster

@@ -97,7 +98,7 @@ class EventsDataset(TimeSeriesDataset):
and the particular quirks of storing and querying them.
"""

def __init__(self) -> None:
def __init__(self, clickhouse_connection_config: ClickhouseConnectionConfig) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for passing the config during instantiation instead of letting the dataset code pick up the right config from settings? This seems confusing because:

  1. whoever needs to get a dataset instance would have to pass a clickhouse config, even if they only need to deal with the datamodel and not run any query.
  2. the config passed in input is not really honored since the dataset instance is cached at the first access and not built again (that's why we ensure get_dataset does not take input parameters other than the dataset name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an easy way to facilitate overriding the values from the cli commands. That'll probably still need to end up an option here, but yes it can be none most of the time and pickup default values from settigns.

@lynnagara
Copy link
Member Author

Going to overhaul this to make the hosts based on storage instead of dataset.

lynnagara added a commit that referenced this pull request Apr 8, 2020
This simply moves the DictClickhouseQuery into a separate file from
ClickhouseQuery.

This has come up before in #844 and #860, and causes circular
dependencies since DictClickhouseQuery (which will be deprecated)
currently depends on Dataset.
lynnagara added a commit that referenced this pull request Apr 8, 2020
This simply moves the DictClickhouseQuery into a separate file from
ClickhouseQuery.

This has come up before in #844 and #860, and causes circular
dependencies since DictClickhouseQuery (which will be deprecated)
currently depends on Dataset.
@lynnagara
Copy link
Member Author

Closing because this is the wrong approach

@lynnagara lynnagara closed this Apr 16, 2020
@lynnagara lynnagara deleted the custom-dataset-host branch April 16, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants