I wonder if anyone can help shed light on a proble...
# general
r
I wonder if anyone can help shed light on a problem that we're seeing. We have a scheduled event triggered lambda that goes off to a 3rd party API and retrieves some data. This request takes about 30s. We then process this data and write it to DynamoDB using individual calls to PutItem, usually around 150 writes. The lambda isn't in a VPC. What we're seeing is the lambda timing out after our limit of 270s. Looking at the traces we see that the DynamoDB writes are happening in parallel but appear to be taking 240+ seconds and cause the timeout. However, if I look at the Cloudwatch metrics, it tells me the writes are only taking 5ms. What could be going on here? Maybe related, maybe not, is that we occasionally see an error: Inaccessible host: `dynamodb.us-west-1.amazonaws.com' at port `443' This doesn't happen very often but is certainly weird, I've raised that with AWS support. Flummoxed!
s
Yeah, that's confusing
I've seen similar instances where poor performance with Lambda + DDB was traced back to the amount of memory the Lambda has available
since Lambdas compute resources increase linearly with memory allocation
but 240s...something definitely isn't right here
r
Yeah, it's got 2gb of memory and I've even bumped it to 8gb to see if there was an effect but there wasn't. In terms of usage, it's getting nowhere near that
s
Are you being throttled on the DDB side?
r
No throttling events in the metrics at all
Curious thing, it's almost every other invocation that times out
s
hmm, and it looks like you are processing a similar number of items in the table each invocation
(if I understand this UI correctly)
and the 3rd party API isn't causing the Lambda timeout by taking longer to respond on some requests?
r
That's right, it's pretty much the same data each time (the API is poor) but it's consistent in its timing, always 30s or so
f
Hey @Ross Coundon I might’ve mentioned this before. We had similar issues, but after putting in this fix, we never got it again https://seed.run/blog/how-to-fix-dynamodb-timeouts-in-serverless-application.html
r
Ah, yes, I'd forgotten about that. Thanks Frank, I'll try that
In fact, looking at it, we have implemented that, but maybe the values are too high
I'm just experimenting with serialising the calls in case this is a node environment issue
f
You can even try disabling retry. And you should see more DynamoDB calls fail AND the error should reflect the root cause (ie. throttled, capacity reached, etc, but not timeouts).
r
Cool, good shout
And as if by magic, a an actual error msg appears.
socket hang up
Now to work out what that means in the context of DDB...
r
Thanks, I landed on the same thing. Unfortunately isn't helping, we're only setting that config in one place already. What's baffling me is that the socket hang up is server-side. Why is a service that's designed to support massive scale hanging up on me when I throw a hundred or so requests at it? I seem to have the situation where a cold start works fine but then the warm start that follows 5 mins later fails. Then the next warm start works, next fails. It's like something isn't being cleaned up in the lambda environment. I've seen an issue with the v3 sdk where this was happening but the workaround before it was fixed was to use v2, which is what we're using here. However, that's a client side retry kinda deal, still doesn't explain why the DDB server is hanging up
There's a very definite pattern of every other invocation timing out. It seems that the tracing was probably confusing things. I've changed this to make all the calls to both DDB and the 3rd party API serially. Now I see almost every one return really quickly. However, what seems to be happening is that every other invocation times out and that timeout is related to a call to the external API. The curious thing is that I'm using axios for this and setting the timeout to 5s. However, the next time the lambda runs, all works fine. Then the next time it times out. None of the tracing solutions I've tried - Epsagon, Thundra or Lumigo can shed any light on this as they truncate the number of external requests that are shown in the trace which is very frustrating. I only know this is what is happening because very occasionally, the timeout on the 3rd party API call is one of the first 20 or so to be made. It also seems like cold starts always work, so it feels to me like there's some kind of network connection or socket caching weirdness or file handle weirdness going on.
s
Hi Ross, the problem has taken may interest. I will try to improve our (Thundra) truncate logic to give more info about on-going requests in case of timeout
On the other hand, I couldn't reproduce your case by just DynamoDB calls. It feels me that it somehow related to 3rd party API call
r
Hi Serkan - if you can help us work out what this is, it'd be amazing
I agree, I think it's to do with the 3rd party API calls
s
Currently, are you able to see the on-going requests at the time of timeout by using any serverless monitoring tool?
r
No, we're blind unless the one that times out is one of the first which is rare
s
Ross, what is your axios version?
r
0.24.0
s
I think because of connection timeout, event loop is blocked
r
Interesting, we've set a 2000ms timeout on the Axios client
But doesn't seem to have an effect on this
s
Does playing with the keep alive flag (true or false) help?
r
which one?
On axios?
Copy code
httpsAgent: new https.Agent({ keepAlive: true }),
?
s
Yes
r
trying now
This looks promising!
Thank you for the suggestion
s
Good, so did you set keep-alive to
true
or
false
? which one worked in your case?
By the way, I am also working on improving our trimming algo to give you more info even in case of high number of external requests and timeout
r
Setting it to true, it's off by default (which was a surprise to me) We also need to come up with a robust way to rate limit our calls to DynamoDB since we start seeing
Copy code
Inaccessible host: `<http://dynamodb.us-west-1.amazonaws.com|dynamodb.us-west-1.amazonaws.com>' at port `443'
When we run lots of requests in parallel
We typically use the bottleneck package for this but the non-native support for promises in v2 of the aws-sdk means chaining a promise() method call which doesn't seem to play nicely. Maybe time to contemplate an upgrade to aws-sdk v3
s
Yes, aws-sdk v3 might be better
We also support aws-sdk v3 BTW
How many concurrent 3rd party API request are you sending?
r
Currently 2 but ideally more
Although not a lot more