I wonder if anyone can help shed light on a problem that we Serverless Stack #general

I wonder if anyone can help shed light on a proble...

Ross Coundon

11/19/2021, 5:38 PM

I wonder if anyone can help shed light on a problem that we're seeing. We have a scheduled event triggered lambda that goes off to a 3rd party API and retrieves some data. This request takes about 30s. We then process this data and write it to DynamoDB using individual calls to PutItem, usually around 150 writes. The lambda isn't in a VPC. What we're seeing is the lambda timing out after our limit of 270s. Looking at the traces we see that the DynamoDB writes are happening in parallel but appear to be taking 240+ seconds and cause the timeout. However, if I look at the Cloudwatch metrics, it tells me the writes are only taking 5ms. What could be going on here? Maybe related, maybe not, is that we occasionally see an error: Inaccessible host: `dynamodb.us-west-1.amazonaws.com' at port `443' This doesn't happen very often but is certainly weird, I've raised that with AWS support. Flummoxed!

Seth Geoghegan

11/19/2021, 6:45 PM

Yeah, that's confusing

Seth Geoghegan

11/19/2021, 6:46 PM

I've seen similar instances where poor performance with Lambda + DDB was traced back to the amount of memory the Lambda has available

Seth Geoghegan

11/19/2021, 6:47 PM

since Lambdas compute resources increase linearly with memory allocation

Seth Geoghegan

11/19/2021, 6:47 PM

but 240s...something definitely isn't right here

Ross Coundon

11/19/2021, 7:19 PM

Yeah, it's got 2gb of memory and I've even bumped it to 8gb to see if there was an effect but there wasn't. In terms of usage, it's getting nowhere near that

Seth Geoghegan

11/19/2021, 7:56 PM

Are you being throttled on the DDB side?

Ross Coundon

11/19/2021, 7:58 PM

No throttling events in the metrics at all

Ross Coundon

11/19/2021, 8:49 PM

Curious thing, it's almost every other invocation that times out

Seth Geoghegan

11/19/2021, 8:50 PM

hmm, and it looks like you are processing a similar number of items in the table each invocation

Seth Geoghegan

11/19/2021, 8:50 PM

(if I understand this UI correctly)

Seth Geoghegan

11/19/2021, 8:50 PM

and the 3rd party API isn't causing the Lambda timeout by taking longer to respond on some requests?

Ross Coundon

11/19/2021, 9:14 PM

That's right, it's pretty much the same data each time (the API is poor) but it's consistent in its timing, always 30s or so

Frank

11/19/2021, 9:25 PM

Hey @Ross Coundon I might’ve mentioned this before. We had similar issues, but after putting in this fix, we never got it again https://seed.run/blog/how-to-fix-dynamodb-timeouts-in-serverless-application.html

Ross Coundon

11/19/2021, 9:26 PM

Ah, yes, I'd forgotten about that. Thanks Frank, I'll try that

Ross Coundon

11/19/2021, 9:28 PM

In fact, looking at it, we have implemented that, but maybe the values are too high

Ross Coundon

11/19/2021, 9:29 PM

I'm just experimenting with serialising the calls in case this is a node environment issue

Frank

11/19/2021, 9:34 PM

You can even try disabling retry. And you should see more DynamoDB calls fail AND the error should reflect the root cause (ie. throttled, capacity reached, etc, but not timeouts).

Ross Coundon

11/19/2021, 10:04 PM

Cool, good shout

Ross Coundon

11/19/2021, 10:15 PM

And as if by magic, a an actual error msg appears.

socket hang up

Now to work out what that means in the context of DDB...

Seth Geoghegan

11/19/2021, 10:49 PM

https://medium.com/@andyhu92/interesting-story-about-how-we-solved-a-socket-hang-up-issue-in-aws-lambda-cd467a6febca

Ross Coundon

11/19/2021, 11:13 PM

Thanks, I landed on the same thing. Unfortunately isn't helping, we're only setting that config in one place already. What's baffling me is that the socket hang up is server-side. Why is a service that's designed to support massive scale hanging up on me when I throw a hundred or so requests at it? I seem to have the situation where a cold start works fine but then the warm start that follows 5 mins later fails. Then the next warm start works, next fails. It's like something isn't being cleaned up in the lambda environment. I've seen an issue with the v3 sdk where this was happening but the workaround before it was fixed was to use v2, which is what we're using here. However, that's a client side retry kinda deal, still doesn't explain why the DDB server is hanging up

Ross Coundon

11/20/2021, 1:47 PM

There's a very definite pattern of every other invocation timing out. It seems that the tracing was probably confusing things. I've changed this to make all the calls to both DDB and the 3rd party API serially. Now I see almost every one return really quickly. However, what seems to be happening is that every other invocation times out and that timeout is related to a call to the external API. The curious thing is that I'm using axios for this and setting the timeout to 5s. However, the next time the lambda runs, all works fine. Then the next time it times out. None of the tracing solutions I've tried - Epsagon, Thundra or Lumigo can shed any light on this as they truncate the number of external requests that are shown in the trace which is very frustrating. I only know this is what is happening because very occasionally, the timeout on the 3rd party API call is one of the first 20 or so to be made. It also seems like cold starts always work, so it feels to me like there's some kind of network connection or socket caching weirdness or file handle weirdness going on.

Serkan Özal

11/20/2021, 6:19 PM

Hi Ross, the problem has taken may interest. I will try to improve our (Thundra) truncate logic to give more info about on-going requests in case of timeout

Serkan Özal

11/20/2021, 6:20 PM

On the other hand, I couldn't reproduce your case by just DynamoDB calls. It feels me that it somehow related to 3rd party API call

Ross Coundon

11/20/2021, 6:55 PM

Hi Serkan - if you can help us work out what this is, it'd be amazing

Ross Coundon

11/20/2021, 6:56 PM

I agree, I think it's to do with the 3rd party API calls

Serkan Özal

11/20/2021, 6:57 PM

Currently, are you able to see the on-going requests at the time of timeout by using any serverless monitoring tool?

Ross Coundon

11/20/2021, 7:05 PM

No, we're blind unless the one that times out is one of the first which is rare

Serkan Özal

11/20/2021, 8:41 PM

Ross, what is your axios version?

Ross Coundon

11/20/2021, 8:42 PM

0.24.0

Serkan Özal

11/20/2021, 8:42 PM

I think because of connection timeout, event loop is blocked

Ross Coundon

11/20/2021, 8:43 PM

Interesting, we've set a 2000ms timeout on the Axios client

Ross Coundon

11/20/2021, 8:43 PM

But doesn't seem to have an effect on this

Serkan Özal

11/20/2021, 8:51 PM

Does playing with the keep alive flag (true or false) help?

Ross Coundon

11/20/2021, 8:54 PM

which one?

Ross Coundon

11/20/2021, 8:56 PM

On axios?

Copy code

httpsAgent: new https.Agent({ keepAlive: true }),

Serkan Özal

11/20/2021, 8:58 PM

Yes

Ross Coundon

11/20/2021, 9:03 PM

trying now

Ross Coundon

11/20/2021, 11:41 PM

This looks promising!

Ross Coundon

11/20/2021, 11:41 PM

Thank you for the suggestion

Serkan Özal

11/21/2021, 7:18 AM

Good, so did you set keep-alive to

true

false

? which one worked in your case?

Serkan Özal

11/21/2021, 7:20 AM

By the way, I am also working on improving our trimming algo to give you more info even in case of high number of external requests and timeout

Ross Coundon

11/21/2021, 9:26 AM

Setting it to true, it's off by default (which was a surprise to me) We also need to come up with a robust way to rate limit our calls to DynamoDB since we start seeing

Copy code

Inaccessible host: `<http://dynamodb.us-west-1.amazonaws.com|dynamodb.us-west-1.amazonaws.com>' at port `443'

When we run lots of requests in parallel

Ross Coundon

11/21/2021, 9:28 AM

We typically use the bottleneck package for this but the non-native support for promises in v2 of the aws-sdk means chaining a promise() method call which doesn't seem to play nicely. Maybe time to contemplate an upgrade to aws-sdk v3

Serkan Özal

11/21/2021, 10:09 AM

Yes, aws-sdk v3 might be better

Serkan Özal

11/21/2021, 10:09 AM

We also support aws-sdk v3 BTW

Serkan Özal

11/21/2021, 10:27 AM

How many concurrent 3rd party API request are you sending?

Ross Coundon

11/21/2021, 10:58 AM

Currently 2 but ideally more

Ross Coundon

11/21/2021, 10:59 AM

Although not a lot more

Open in Slack

Previous Next