Some of you may remember from last week I was hunt...
# random
r
Some of you may remember from last week I was hunting down a peculiar issue with timeouts and DynamoDB. I raised a case with AWS on the subject and this is what they said
Copy code
When you use Lambda outside VPC, it means a request to DynamoDB will go over the internet which might traverse over multiple ISPs. This leaves room where requests may get lost along the path, this is especially true if the issue is intermittent and error shows timeout exceptions.

It is actually better from a DynamoDB perspective to have your Lambda inside a VPC, this is because the network path becomes predictable, thus more stable. When you have your Lambda inside the VPC, you can make use of DynamoDB's VPC endpoint to access DynamoDB (Lambda -> Lambda ENI - > DDB VPCendpoint -> DDB service). Furthermore, this allows better monitoring, because you will be able to use VPC flow logs to see network traffic and determine if messages are being dropped. Since DNS uses UDP for DNS queries, the packets can get lost along the path (Lambda probably has a retry mechanism for retry DNS queries).
One of the main reasons we moved to DynamoDB was to remove the need for the complexities of VPCs!
t
Where did you get this info from?
If you don't use a VPC lambdas still run in an Amazon controlled VPC
r
AWS Developer Support
t
I'm not sure if this info is accurate 😬
r
😬 indeed
c
Haha this also doesn't sound legit to me.... Your packet connection dropping was almost every second request, far too frequent and predictable to be "random internet path issues"
s
Yeah, seems like a best guess so someone could close a ticket 🤷
ö
I mean it sounds logical tho, not?
s
I mean, I guess internet traffic has something to do with ISPs? 🙂
ö
yes
m
If I have a Lambda function in us-east-1 calling a table in us-east-1, why does my traffic leave us-east-1?
ö
I’m not sure if the message implies so
t
It does seem to imply that. Traffic between lambda and dynamo should stay inside AWS's network
m
Maybe something else happens with global tables? But this is the first time I've heard this perspective.
t
let me ping some people on this
r
Cool, thanks @thdxr would be interesting to find out what others think is going on. @Matt Morgan that would make a bit more sense but in our case the tables are entirely region specific
m
I'd be a little disappointed if global tables worked like that, to be honest, but I don't know one way or the other.
s
I also feel like this would be called out somewhere in the docs?
m
To me the promise of the tech stack is that AWS has managed the networking for us in a good way. If my DDB traffic is being strewn across multiple ISPs and accepting whatever latency they might be adding - well, that just can't be right.
t
IIRC, that's accurate (that a lambda not in a vpc is hitting ddb over the internet), but I also think I understood that AWS still optimizes the network traffic (internet or not) between two services like that (regardless of region); amazon global backbone yada yada.
[1:10 PM]
r
Still fighting with this; rather tired of it now. Now we're seeing, after a bunch of writes intermittent socket hang ups Since DynamoDB was built for scale, this just doesn't make sense to me. In this case we're talking about doing 250 writes to a table, it's not exactly Netflix scale. Sticking it in a VPC as per the recommendation from support is mad because we'll need one, possibly two (for redundancy), always-on NAT Gateways and a way more complicated architecture. 😞
And if it's a case of tuning the http agent, which it could be, there's almost no documentation on what the various timeout values should be
ö
I’m really curious for the answer to the problem
m
This isn't something that's solved with http-keepalive is it?
r
Sadly not, we have that enabled. I actually just saw a socket timeout for a single dynamodb query when the system was doing nothing else. I can't find any recommended http agent settings for retry, connection timeout and operation timeout. I feel like I must be missing something since so many people are using this and not seeing (or noticing) this problem. Although maybe there are two problems, a socket timeout problem and a long operation problem
m
Are your items large? Do you have many GSIs/LSIs? Could you have a hot partition? Is your throughput very high?
r
There's only one additional index, a GSI. Some of the items that are selected are up to 1.6kb. PKs are unique for each so no hot partitions. Doesn't seem crazy does it?
m
What kind of throughput by operation? Dynamo should handle it but just curious.
r
What do you mean by 'throughput by operation', Matt?
m
If you are very write-heavy and the GSI is potentially hot, that could be it. There's no way your problem is "traffic over internet".
I mean how many puts/updates/gets per second.
Do you fail more on reads or writes?
r
Well, the pattern is that there are a lot of GetItems that happen as a result of the retrieval of a payload from a 3rd party API. Originally we were trying to do about 500 in parallel and we got a tonne of failures so, to workaround this, we've rate limited our calls. Then, for any records that we don't find, we'll perform a write, again, we've rate limited this to stop the failures. So in terms of operations per second, we're probably looking at 100 maybe. However, I've seen a socket timeout today with a single GetItem with nothing else going on. However, we've turned off retries to see what effect that would have so maybe we only saw that because it couldn't retry. Our setup of AWS/DDB looks like
Copy code
const agent = new https.Agent({
  maxSockets: 1000,
  keepAlive: true,
});

AWS.config.update({
  httpOptions: {
    timeout: 10_000,
    connectTimeout: 5000,
    agent,
  },
  maxRetries: 0,
});
but we've tried all kinds of different values and permutations
m
That's a single GetItem using the partition key? Doesn't make sense at all to me. I hope you can get your case escalated.
r
That's right, and it's totally bizarre. AWS closed the case because I hadn't responded because I was trying to make sense of it all, I reopened it on Friday.
m
I have never had to mess around with the http client other than setting keepalive and we've hit lambda concurrency limits without DDB throttling/failure.
r
which method do you use to set keepalive?
m
Env var - originally set them explicitly, later by means of various CDK abstractions.
r
AWS_NODEJS_CONNECTION_REUSE_ENABLED?
m
Yes, that's the one but it should be just the same as what you've done with the client.
Also failing to reuse your connection couldn't cause a single request to time out.
r
These are our writes too, taken from the Thundra traces. Time for the operation is so variable, and high
All these writes are identical
t
have you tried implementing this function in another language
wonder if it's on the dynamo side or the caller side
m
Are those individual function calls or is one function doing all that?
r
So, after further analysis, CloudWatch metrics show that the operations on DDB are happening in <20ms, so the times shown above are what the function experiences and show there's some latency being added outside of the actual DDB read or write. I've tried bumping maxSockets up as far as 1000 with keepAlive on but that makes no difference. @Matt Morgan - that's one lambda function call. @thdxr - I've wondered about rewriting in a different language but we're in the middle of a project to change the architecture to use a different 3rd party API which gives us things that have changed rather than an entire data set every time which we'll handle via a fan-out using SNS or similar. I've also thought about taking that approach here but I'm not sure the investment of time is worth it given we'll replace it reasonably soon anyway.
I agree, it has started to feel like a NodeJs 'thing' though
m
So how many requests does the one function make? Do you await each one or is it a Promise.all? How many total bytes?
r
At present, we're limiting this to 20 concurrent DynamoDB operations awaited using Promise.all
Less than 30kb of data across 20 operations
m
Still doesn't seem like something that should be a problem. Did you try awaiting each one to see the difference?
Ross, are you using
batchWrite
? This might be of interest to you. https://github.com/elthrasher/cdk-dynamo-lambda-loader. I wanted to see if I could write a million items in less than a minute (and I could). That reminds me, are you using provisioned or on demand capacity?
r
It's On Demand. We haven't tried awaiting each one, it's a good idea, I'll see if we can get that in to try
Thanks for the link to the repo, definitely taking a look at that. We've thought about batchWrite & batchGet but it adds a fair amount of complexity to this process so we've shied away up until now
s
I saw you ping'd Alex Debrie on the topic. Interesting to see if sending a DM/tweet to Rick Houlihan would be worthwhile. Something funky is going on here, and you can't be the first person to come across this
r
AWS support came back with this recommendation, bearing in mind they know this is a lambda, is it as wide of the mark as I think it is? However, I found this post [1], in which someone explains that they were able to bypass the issue by switching to Google's DNS [2] and rebooting the host, so I would urge you to try doing that.
t
what is going on with AWS support lol
s
Yeah, I don't know what to make of that advice
m
holy crap
r
That's what I thought, then I started doubting myself thinking, this is 'AWS Premium Support', they must know what they're talking about. The other info they provided was about the operation latency which I'd already explained was double digit ms and not the issue.