Some of you may remember from last week I was hunting down a Serverless Stack #random

Some of you may remember from last week I was hunt...

Ross Coundon

11/23/2021, 1:56 PM

Some of you may remember from last week I was hunting down a peculiar issue with timeouts and DynamoDB. I raised a case with AWS on the subject and this is what they said

Copy code

When you use Lambda outside VPC, it means a request to DynamoDB will go over the internet which might traverse over multiple ISPs. This leaves room where requests may get lost along the path, this is especially true if the issue is intermittent and error shows timeout exceptions.

It is actually better from a DynamoDB perspective to have your Lambda inside a VPC, this is because the network path becomes predictable, thus more stable. When you have your Lambda inside the VPC, you can make use of DynamoDB's VPC endpoint to access DynamoDB (Lambda -> Lambda ENI - > DDB VPCendpoint -> DDB service). Furthermore, this allows better monitoring, because you will be able to use VPC flow logs to see network traffic and determine if messages are being dropped. Since DNS uses UDP for DNS queries, the packets can get lost along the path (Lambda probably has a retry mechanism for retry DNS queries).

One of the main reasons we moved to DynamoDB was to remove the need for the complexities of VPCs!

thdxr

11/23/2021, 1:58 PM

Where did you get this info from?

thdxr

11/23/2021, 1:58 PM

If you don't use a VPC lambdas still run in an Amazon controlled VPC

Ross Coundon

11/23/2021, 1:58 PM

AWS Developer Support

thdxr

11/23/2021, 1:58 PM

I'm not sure if this info is accurate 😬

Ross Coundon

11/23/2021, 1:59 PM

😬 indeed

Chad (cysense)

11/23/2021, 2:00 PM

Haha this also doesn't sound legit to me.... Your packet connection dropping was almost every second request, far too frequent and predictable to be "random internet path issues"

Seth Geoghegan

11/23/2021, 2:01 PM

Yeah, seems like a best guess so someone could close a ticket 🤷

Ömer Toraman

11/23/2021, 3:56 PM

I mean it sounds logical tho, not?

Seth Geoghegan

11/23/2021, 3:57 PM

I mean, I guess internet traffic has something to do with ISPs? 🙂

Ömer Toraman

11/23/2021, 3:58 PM

yes

Matt Morgan

11/23/2021, 3:58 PM

If I have a Lambda function in us-east-1 calling a table in us-east-1, why does my traffic leave us-east-1?

Ömer Toraman

11/23/2021, 3:59 PM

I’m not sure if the message implies so

thdxr

11/23/2021, 5:04 PM

It does seem to imply that. Traffic between lambda and dynamo should stay inside AWS's network

Matt Morgan

11/23/2021, 5:08 PM

Maybe something else happens with global tables? But this is the first time I've heard this perspective.

thdxr

11/23/2021, 5:09 PM

let me ping some people on this

Ross Coundon

11/23/2021, 6:27 PM

Cool, thanks @thdxr would be interesting to find out what others think is going on. @Matt Morgan that would make a bit more sense but in our case the tables are entirely region specific

Matt Morgan

11/23/2021, 7:36 PM

I'd be a little disappointed if global tables worked like that, to be honest, but I don't know one way or the other.

Seth Geoghegan

11/23/2021, 7:36 PM

I also feel like this would be called out somewhere in the docs?

Matt Morgan

11/23/2021, 7:37 PM

To me the promise of the tech stack is that AWS has managed the networking for us in a good way. If my DDB traffic is being strewn across multiple ISPs and accepting whatever latency they might be adding - well, that just can't be right.

thdxr

11/23/2021, 7:39 PM

IIRC, that's accurate (that a lambda not in a vpc is hitting ddb over the internet), but I also think I understood that AWS still optimizes the network traffic (internet or not) between two services like that (regardless of region); amazon global backbone yada yada.

[1:10 PM]

Ross Coundon

12/01/2021, 11:00 PM

Still fighting with this; rather tired of it now. Now we're seeing, after a bunch of writes intermittent socket hang ups Since DynamoDB was built for scale, this just doesn't make sense to me. In this case we're talking about doing 250 writes to a table, it's not exactly Netflix scale. Sticking it in a VPC as per the recommendation from support is mad because we'll need one, possibly two (for redundancy), always-on NAT Gateways and a way more complicated architecture. 😞

Ross Coundon

12/01/2021, 11:15 PM

And if it's a case of tuning the http agent, which it could be, there's almost no documentation on what the various timeout values should be

Ömer Toraman

12/01/2021, 11:30 PM

I’m really curious for the answer to the problem

Matt Morgan

12/04/2021, 4:09 PM

This isn't something that's solved with http-keepalive is it?

Ross Coundon

12/04/2021, 5:18 PM

Sadly not, we have that enabled. I actually just saw a socket timeout for a single dynamodb query when the system was doing nothing else. I can't find any recommended http agent settings for retry, connection timeout and operation timeout. I feel like I must be missing something since so many people are using this and not seeing (or noticing) this problem. Although maybe there are two problems, a socket timeout problem and a long operation problem

Matt Morgan

12/04/2021, 6:20 PM

Are your items large? Do you have many GSIs/LSIs? Could you have a hot partition? Is your throughput very high?

Ross Coundon

12/04/2021, 6:27 PM

There's only one additional index, a GSI. Some of the items that are selected are up to 1.6kb. PKs are unique for each so no hot partitions. Doesn't seem crazy does it?

Matt Morgan

12/04/2021, 6:29 PM

What kind of throughput by operation? Dynamo should handle it but just curious.

Ross Coundon

12/04/2021, 6:30 PM

What do you mean by 'throughput by operation', Matt?

Matt Morgan

12/04/2021, 6:30 PM

If you are very write-heavy and the GSI is potentially hot, that could be it. There's no way your problem is "traffic over internet".

Matt Morgan

12/04/2021, 6:30 PM

I mean how many puts/updates/gets per second.

Matt Morgan

12/04/2021, 6:31 PM

Do you fail more on reads or writes?

Ross Coundon

12/04/2021, 6:35 PM

Well, the pattern is that there are a lot of GetItems that happen as a result of the retrieval of a payload from a 3rd party API. Originally we were trying to do about 500 in parallel and we got a tonne of failures so, to workaround this, we've rate limited our calls. Then, for any records that we don't find, we'll perform a write, again, we've rate limited this to stop the failures. So in terms of operations per second, we're probably looking at 100 maybe. However, I've seen a socket timeout today with a single GetItem with nothing else going on. However, we've turned off retries to see what effect that would have so maybe we only saw that because it couldn't retry. Our setup of AWS/DDB looks like

Copy code

const agent = new https.Agent({
  maxSockets: 1000,
  keepAlive: true,
});

AWS.config.update({
  httpOptions: {
    timeout: 10_000,
    connectTimeout: 5000,
    agent,
  },
  maxRetries: 0,
});

but we've tried all kinds of different values and permutations

Matt Morgan

12/04/2021, 6:37 PM

That's a single GetItem using the partition key? Doesn't make sense at all to me. I hope you can get your case escalated.

Ross Coundon

12/04/2021, 6:39 PM

That's right, and it's totally bizarre. AWS closed the case because I hadn't responded because I was trying to make sense of it all, I reopened it on Friday.

Matt Morgan

12/04/2021, 6:41 PM

I have never had to mess around with the http client other than setting keepalive and we've hit lambda concurrency limits without DDB throttling/failure.

Ross Coundon

12/04/2021, 6:43 PM

which method do you use to set keepalive?

Matt Morgan

12/04/2021, 6:44 PM

Env var - originally set them explicitly, later by means of various CDK abstractions.

Ross Coundon

12/04/2021, 6:45 PM

AWS_NODEJS_CONNECTION_REUSE_ENABLED?

Matt Morgan

12/04/2021, 6:46 PM

Yes, that's the one but it should be just the same as what you've done with the client.

Matt Morgan

12/04/2021, 6:46 PM

Also failing to reuse your connection couldn't cause a single request to time out.

Ross Coundon

12/04/2021, 10:09 PM

These are our writes too, taken from the Thundra traces. Time for the operation is so variable, and high

Ross Coundon

12/04/2021, 10:09 PM

All these writes are identical

thdxr

12/06/2021, 12:46 AM

have you tried implementing this function in another language

thdxr

12/06/2021, 12:46 AM

wonder if it's on the dynamo side or the caller side

Matt Morgan

12/06/2021, 12:51 AM

Are those individual function calls or is one function doing all that?

Ross Coundon

12/06/2021, 7:23 AM

So, after further analysis, CloudWatch metrics show that the operations on DDB are happening in <20ms, so the times shown above are what the function experiences and show there's some latency being added outside of the actual DDB read or write. I've tried bumping maxSockets up as far as 1000 with keepAlive on but that makes no difference. @Matt Morgan - that's one lambda function call. @thdxr - I've wondered about rewriting in a different language but we're in the middle of a project to change the architecture to use a different 3rd party API which gives us things that have changed rather than an entire data set every time which we'll handle via a fan-out using SNS or similar. I've also thought about taking that approach here but I'm not sure the investment of time is worth it given we'll replace it reasonably soon anyway.

Ross Coundon

12/06/2021, 7:24 AM

I agree, it has started to feel like a NodeJs 'thing' though

Matt Morgan

12/06/2021, 12:16 PM

So how many requests does the one function make? Do you await each one or is it a Promise.all? How many total bytes?

Ross Coundon

12/06/2021, 12:35 PM

At present, we're limiting this to 20 concurrent DynamoDB operations awaited using Promise.all

Ross Coundon

12/06/2021, 12:36 PM

Less than 30kb of data across 20 operations

Matt Morgan

12/06/2021, 12:37 PM

Still doesn't seem like something that should be a problem. Did you try awaiting each one to see the difference?

Matt Morgan

12/06/2021, 2:18 PM

Ross, are you using

batchWrite

? This might be of interest to you. https://github.com/elthrasher/cdk-dynamo-lambda-loader. I wanted to see if I could write a million items in less than a minute (and I could). That reminds me, are you using provisioned or on demand capacity?

Ross Coundon

12/06/2021, 2:28 PM

It's On Demand. We haven't tried awaiting each one, it's a good idea, I'll see if we can get that in to try

Ross Coundon

12/06/2021, 2:30 PM

Thanks for the link to the repo, definitely taking a look at that. We've thought about batchWrite & batchGet but it adds a fair amount of complexity to this process so we've shied away up until now

Seth Geoghegan

12/06/2021, 3:27 PM

I saw you ping'd Alex Debrie on the topic. Interesting to see if sending a DM/tweet to Rick Houlihan would be worthwhile. Something funky is going on here, and you can't be the first person to come across this

Ross Coundon

12/06/2021, 7:21 PM

AWS support came back with this recommendation, bearing in mind they know this is a lambda, is it as wide of the mark as I think it is? However, I found this post [1], in which someone explains that they were able to bypass the issue by switching to Google's DNS [2] and rebooting the host, so I would urge you to try doing that.

thdxr

12/06/2021, 7:22 PM

what is going on with AWS support lol

Seth Geoghegan

12/06/2021, 8:06 PM

Yeah, I don't know what to make of that advice

Matt Morgan

12/06/2021, 8:38 PM

holy crap

Ross Coundon

12/06/2021, 9:13 PM

That's what I thought, then I started doubting myself thinking, this is 'AWS Premium Support', they must know what they're talking about. The other info they provided was about the operation latency which I'd already explained was double digit ms and not the issue.

Open in Slack

Previous Next