hi all, we're continuing to see `OSError: [Errno 2...
# all-things-deployment
w
hi all, we're continuing to see
OSError: [Errno 24] Too many open files\n']
errors in the
actions
container once it has been online for a while ingesting from a glue data source. there seems to be a connection/file reference leak somewhere, any thoughts?
d
w
hi @delightful-ram-75848! we're not doing any profiling, so i dont think that's going to change much. we're using the Acryl-built Docker image, so increasing the max number of files would be something y'all would need to do in the docker image (unless we start doing a custom build on top of it). even increasing the max open files will only just delay the issue i believe, as it'll allow more leaked file handles to build up before causing a problem in the process. ideally there would be two fixes i believe, one easier than the other: 1. when actions encounters an error like this, the process should exit rather than staying online but unable to actually do anything else. 1a. actions could expose an http health check endpoint that 500s when something isnt working properly, and we could then use a health check to kill the process. 2. the leaky open file handle in the glue ingestion (i believe) would be found and removed, though this is probably hard to find. the primary pain is that right now we dont have a way to find/catch these errors and restart the actions container, so the only way to catch this is to notice our last seen date is stale and then manually stop the container so it's replaced.
h
Hey @witty-motorcycle-52108 While the source of leaky connections is unknown, could you please help with below answers to reach closer to it ? 1. what is your datahub version and
cli_version
as logged in ingestion report ? 2. Are there any other ingestions running in your actions container or only glue ingestion ? How frequently do these ingestions run ? Are these ingestions using default (rest) sink ? Are you using any transformers in your recipe ? How often do these ingestions fail ? 3. How frequently are you seeing this 'Too many open files' errors ? In other terms, once you restart the actions ?container, how long after does this error start appearing ? 4. Can you share entire stack trace of this error from your ingestion log ?
w
version is
0.10.1
, this has been happening for a while now though. logs say
"cli_version": "0.10.1",
. only the glue ingestion runs, and it runs hourly right now. it used to run daily, and we saw it happening then as well. default rest sink is being used, it's a fairly "vanilla" ingestion configuration. no transforms. it's not super frequent, the last one happened on may 15th and had been fine since being deployed on april 5th-ish.
here are some sanitized logs. i included the last successful run before errors began, and then a few runs that errored.
h
Thanks. I'll check and get back as soon as I can.
w
thank you!
h
Hey Tim ! Quick inspection of glue connector code did not reveal any suspects for leak. I suspect that glue connector is not at fault here, the culprit is somewhere else (probably managed ingestion/ actions framework). While we continue to debug this, I would like to suggest restarting datahub-actions container once every day (if using hourly ingestion schedule) or once every week (if using daily ingestion schedule) to avoid this issue.
w
interesting, thanks for looking into it! i'll look into setting up some sort of trigger to automatically reboot the actions container regularly.