Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustered Alertmanager instances not joining via gossip #1449

Closed
povilasv opened this issue Jul 1, 2018 · 3 comments
Closed

Clustered Alertmanager instances not joining via gossip #1449

povilasv opened this issue Jul 1, 2018 · 3 comments

Comments

@povilasv
Copy link
Contributor

povilasv commented Jul 1, 2018

What did you do?

Started clustered alertmanager (v.0.15) in Kubernetes and restarted all nodes.

Ref #1428

What did you expect to see?

Eventually alertmanager instances joining into gossip.

What did you see instead? Under which circumstances?

The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form.

  • Alertmanager version:

alertmanager:v0.15.0-rc.1

  • Alertmanager configuration file:
apiVersion: v1
kind: Service
metadata:
  labels:
    name: alertmanager-peers
  name: alertmanager-peers
  namespace: sys-mon
spec:
  clusterIP: None
  ports:
  - name: cluster
    protocol: TCP
    port: 8001
    targetPort: cluster
  selector:
    app: alertmanager
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: sys-mon
spec:
  replicas: 3
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.15.0-rc.1
        args:
          - --config.file=/etc/alertmanager/config.yml
          - --web.listen-address=0.0.0.0:9093
          - --storage.path=/alertmanager
          - --web.external-url=https://alertmanager.dev.uw.systems
          - --cluster.listen-address=0.0.0.0:8001
          - --cluster.peer=alertmanager-peers.sys-mon:8001
  • Logs:

logs of alertmanager1:

level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305

alertmanager2:

evel=info ts=2018-06-21T14:35:44.824589916Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725
793e07cf3)"
level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial
tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305

Metrics

All alert manager metrics endpoints show: alertmanager_cluster_members 1

@stuartnelson3
Copy link
Contributor

Thanks for reporting this!

@therealgambo
Copy link

+1 just experienced this now whilst redeploying our cluster.

@povilasv
Copy link
Contributor Author

Closing as #1428 got merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants