-
Notifications
You must be signed in to change notification settings - Fork 110
Option to aggregate channel, queue and connection metrics #28
Conversation
`prometheus.enable_metric_aggregation = true` rabbitmq-prometheus#26
Picking this one up now. |
Signed-off-by: Gerhard Lazu <[email protected]>
Signed-off-by: Gerhard Lazu <[email protected]>
Given 1k queues on rmq2, scrape duration is 2-3s. Running RabbitMQ-Overview dashboard is not affected by these changes. I will test RabbitMQ-Quorum-Queues-Raft tomorrow - I expect a few changes needed there. I will increase the number of queues all the way to 80k and see if this still holds. The last phase is to increase the number of connections & channels to 80k each and see if this optimisations holds. |
#24 (comment) Before we can test the effectiveness of the fix in #26 against an environment replica that this was initially reported in, we are missing the load app deployment that would generate all the connections and queues. It would be helpful to know whether https://github.com/coreos/kube-prometheus was used for the Prometheus & Grafana deployment. Signed-off-by: Gerhard Lazu <[email protected]>
I am picking this one up again, deploying 50k queues, 50k connections & 50k channels. |
Tested on:
with 50k queues with & without metric aggregation enabled:
When I had 50k connections on top of the 50k queues the metrics would timeout after 60s:
With metric aggregation enabled & then with
|
Option to aggregate channel, queue and connection metrics (cherry picked from commit 82fafae)
It behaves as a gauge when metrics are not aggregated, which is not what we want. It's either a histogram, or it's a gauge, but it doesn't change type based on whether we are aggregating metrics or not. To be honest, I rushed #28 acceptance and didn't check this metric type propertly. Hoping to pair-up with @dcorbacho on this and getting the histogram back. For now, let's keep it a gauge and go forward with the metric aggregation back-port into v3.8.x. Because the change from millis to micros was a breaking change in rabbitmq/ra#160, it was reverted, so we had to fix (missed the undefined in one of the merges - whoops) and revert to millis. Signed-off-by: Gerhard Lazu <[email protected]>
@dcorbacho can we pair-up on this tomorrow? 5caa419 |
We want to keep the same metric type regardless whether we aggregate or don't. If we had used a histogram type, considering the ~12 buckets that we added, it would have meant 12 extra metrics per queue which would have resulted in an explosion of metrics. Keeping the gauge type and aggregating latencies across all members. re #28 Signed-off-by: Gerhard Lazu <[email protected]>
We want to keep the same metric type regardless whether we aggregate or don't. If we had used a histogram type, considering the ~12 buckets that we added, it would have meant 12 extra metrics per queue which would have resulted in an explosion of metrics. Keeping the gauge type and aggregating latencies across all members. re #28 Signed-off-by: Gerhard Lazu <[email protected]> (cherry picked from commit 3a24c4a) Signed-off-by: Gerhard Lazu <[email protected]>
It behaves as a gauge when metrics are not aggregated, which is not what we want. It's either a histogram, or it's a gauge, but it doesn't change type based on whether we are aggregating metrics or not. To be honest, I rushed rabbitmq/rabbitmq-prometheus#28 acceptance and didn't check this metric type propertly. Hoping to pair-up with @dcorbacho on this and getting the histogram back. For now, let's keep it a gauge and go forward with the metric aggregation back-port into v3.8.x. Because the change from millis to micros was a breaking change in rabbitmq/ra#160, it was reverted, so we had to fix (missed the undefined in one of the merges - whoops) and revert to millis. Signed-off-by: Gerhard Lazu <[email protected]>
We want to keep the same metric type regardless whether we aggregate or don't. If we had used a histogram type, considering the ~12 buckets that we added, it would have meant 12 extra metrics per queue which would have resulted in an explosion of metrics. Keeping the gauge type and aggregating latencies across all members. re rabbitmq/rabbitmq-prometheus#28 Signed-off-by: Gerhard Lazu <[email protected]> (cherry picked from commit 3a24c4a7b44e3cb4c1b60918c85052bf667a053e) Signed-off-by: Gerhard Lazu <[email protected]>
prometheus.return_per_object_metrics = false
Closes #26, see #24 and #25 for the background.