hi all - we had our GMS service go down hard a few...
# all-things-deployment
w
hi all - we had our GMS service go down hard a few days ago for reasons that arent clear. when it started up it was outputting that it was listening to upgrade topic(s), but then just hanging. i re-ran the kafka setup and upgrade tasks, and it came back up eventually. what can cause gms to just spontaneously die and require a re-run of the kafka setup and upgrade container in order to bring it back online when nothing was deployed or changed about the environment, other than containers restarting occasionally when they get moved around?
d
Hi- not sure what’s happening here, do you have a scheduled ingestion around the time gms died or did you see any weird metrics in terms of cpu/memory metrics, etc? (cc. @orange-night-91387 Would love you help here)
w
we ingest hourly, so maybe, but im not sure. i'll pull some more info today to drop here including (hopefully) the logs from the time it happened. the inability to bring it back online without re-running setup/upgrade tasks is super weird to me. we're using MSK and dont have any non-default settings for retention configured and i see that the kafka setup tasks configures retention to
-1
anyways, so it would be very weird to me if it was a message expiring
a
How are you deploying? If using k8s your deployments should be self healing. Not sure why it would spontaneously shut down unless underlying resources did. Do you store persistent logs that could give clues into what happened as it was dying? It is definitely weird that you would have to run the upgrade again. Have you checked the upgrade topic configurations to verify it has the expected configurations (esp. the infinite retention)?
w
we deploy via ECS, not kubernetes. the deploys tried to self heal but the new tasks just hung after listening to the upgrade topic and never completed startup