CoreOs cluster restarted all containers due to fleet or etcd errors

Hello
  We just saw a pretty server issue on our production CoreOs setup. Details are:

- 3 CoreOs nodes running in AWS EC2 Us East1
- m3.2xlarge instance types
- CoreOS nodes - 2 are DISTRIB_RELEASE=1068.2.0 and 1 is at DISTRIB_RELEASE=1081.5.0
- etcd version 0.4.9
- we have auto update disabled on CoreOs
- Around 21:56 UTC on Jan 17 we saw all our containers go down and the logs seemed to suggest an issue with etcd

Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:157: Establishing etcd connectivity
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:179: Engine leadership acquisition failed: context deadline exceeded
Jan 17 21:59:41 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:168: Starting server components
Jan 17 21:59:42 ip-10-26-31-100.ec2.internal fleetd[999]: INFO engine.go:185: Engine leadership acquired
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(kafka-broker-1.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR reconciler.go:62: Failed resolving task: task={Type: UnscheduleUnit, JobName: kafka-broker-1.service, MachineID: 6ca65ead2f164b2682c0d941c
Jan 17 21:59:44 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(newNewApps.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded

- We checked the CPU and disk IO for all 3 instances, there is NO indication of any CPU spike per AWS Cloudwatch
- ETCD config is as below
core@ip-10-26-33-251 ~ $ sudo systemctl cat etcd
# /usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd
Conflicts=etcd2.service

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000

# /run/systemd/system/etcd.service.d/10-oem.conf
[Service]
Environment=ETCD_PEER_ELECTION_TIMEOUT=1200
# /run/systemd/system/etcd.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADDR=10.26.33.251:4001"
Environment="ETCD_CERT_FILE=/home/etcd/certs/cert.crt"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/<BLAH>"
Environment="ETCD_KEY_FILE=/home/etcd/certs/key.pem"
Environment="ETCD_PEER_ADDR=10.26.33.251:7001"
- Attached the fleet & etcd logs from all nodes

[etcd-10-26-31-100.txt](https://github.com/coreos/etcd/files/712606/etcd-10-26-31-100.txt)
[etcd-10-26-32-94.txt](https://github.com/coreos/etcd/files/712607/etcd-10-26-32-94.txt)
[etcd-10-26-33-251.txt](https://github.com/coreos/etcd/files/712604/etcd-10-26-33-251.txt)
[fleet-10-26-31-100.txt](https://github.com/coreos/etcd/files/712603/fleet-10-26-31-100.txt)
[fleet-10-26-32-94.txt](https://github.com/coreos/etcd/files/712605/fleet-10-26-32-94.txt)
[fleet-10-26-33-251.txt](https://github.com/coreos/etcd/files/712602/fleet-10-26-33-251.txt)
- AWS status dashboard does not show any errors or issues on their end

Appreciate if someone can take a look at the above and give us any pointers on what to look at and what we can do to mitigate this.

I opened a fleet ticket - https://github.com/coreos/etcd/issues/7177 and was redirected to here

Thx
Maulik
[etcd-10-26-31-100.txt](https://github.com/coreos/fleet/files/712820/etcd-10-26-31-100.txt)
[etcd-10-26-32-94.txt](https://github.com/coreos/fleet/files/712821/etcd-10-26-32-94.txt)
[etcd-10-26-33-251.txt](https://github.com/coreos/fleet/files/712825/etcd-10-26-33-251.txt)
[fleet-10-26-31-100.txt](https://github.com/coreos/fleet/files/712822/fleet-10-26-31-100.txt)
[fleet-10-26-32-94.txt](https://github.com/coreos/fleet/files/712823/fleet-10-26-32-94.txt)
[fleet-10-26-33-251.txt](https://github.com/coreos/fleet/files/712824/fleet-10-26-33-251.txt)







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

/usr/lib64/systemd/system/etcd.service

/run/systemd/system/etcd.service.d/10-oem.conf

/run/systemd/system/etcd.service.d/20-cloudinit.conf

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

Description

/usr/lib64/systemd/system/etcd.service

/run/systemd/system/etcd.service.d/10-oem.conf

/run/systemd/system/etcd.service.d/20-cloudinit.conf

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions