I’m building a feature which is going to be writin...
# help
a
I’m building a feature which is going to be writing events to SQS from a lambda. The incoming throughput can go very high ~ 2000 req/s and these would be processed in realtime . Should I be using lambda and SQS for this or something else. Anyone processing data at such scale, how are you approaching it, should I be looking into MQTT or Kafka. I’m worried about back-pressure and throttling.
t
What does this come in from - apig?
a
yep, apigv2.
t
apigv2 -> lambda -> sqs -> lambda?
a
yep.
t
If your messages sizes are small and you only have a single consumer (not pub sub) then your setup is probably what's best afaik
The other option is Kinesis which is more of a pain but allows things like having different consumers process the same messages / replay
Otherwise your sqs + lambda setup is simple and should scale very nicely - make sure you set your queue consumer options so you're saving money
invoking lambdas in batches of 1000 instead of 1 for example
a
yep, single consumer. the consumer lambda just validates the event payload and writes to dynamodb. How do I decide my batchsize, supposing that the throughput is low, say 100 req / s but the events need to be processed immediately and my batchsize is set for 1000?
this would put a 10 second wait, right?
t
no aws should invoke your lambda in parallel
there's 2 parameters on the consumer one sec
a
oh okay.
t
so it's
maxBatchingWindow: time
and
batchSize
d
the default is to poll for 20 seconds in an attempt to reach the batchSize, and if not, process whatever there is.
t
Basically aws will trigger a lambda when either max batching window is hit or batchSize
d
max batching window adds polls to this, for instance if 40 seconds, it would poll twice
t
So if you set it to 10 seconds + 1000 items, worst case your events will be processed with a 10 second delay
a
Ah! good catch, perfect, this solves my dilemma. Great. 🙏
t
once you have this going you'll probably feel like "wow this is magic" because it'll perfectly scale
a
thank you @thdxr and @Derek Kershner. I appreciate it.
Yeah, another testimonial is coming lol in the next 2 - 3 weeks. 😂😂😂
t
A company I advise made a mistake last week and accidentally 100x'd the events they were receiving. Nothing was effected, sqs scaled, lambdas scaled and dynamo scaled
cost an extra $35
for an otherwise system destroying mistake
a
My company still doesn’t believe that we serve ~ 10 M API requests in $85. 😂
d
just in case its important, the 20 second poll time setting is
receiveMessageWaitTime: Duration.seconds(20)
, and it is on the Queue itself (Dax’s are on the event source)
to achieve 10 seconds, you would need to lower this as well
a
oh, good eye!
I should keep this queue as FIFO, right? Any idea how FIFO guarantees ordering?
d
on ordering, it just orders based on time received to my knowledge. duplication is the stronger use case, and they provide an id param.
a
Ah! de-duplication would be lovely, thanks I’ll venture into more depth on the fifo based features. Thank you once again for all your assistance @thdxr and @Derek Kershner. 😁🙏
d
FIFO is 20% more expensive, if you can make it idempotent, you should
a
If fifo saves me time, I’ll prefer fifo as the feature will make me more money than what fifo might cost. I really don’t want my frontend logic to become convoluted to ensure idempotency. I’ll think on it, thank you for the heads-up.
d
makes sense to me
I just looked it up, I was thinking of FIFO SNS, FIFO SQS aint so bad (20% more).
a
Yeah, on-demand is dirt cheap. 😂
a
Lower bandwidth on FIFO queues too, so watch out for that. The place I'd use Kinesis for is very high bandwidth and sharding data to help (but not guarantee) order. It's popular for IoT data ingestion, sharding on the device ID so that the same Lambda processes all messages from the same device.
a
I am going to take a day and do some in-depth reading but from what I understand AWS MSK is managed Kafka as a service and almost all large scale enterprises that rely on realtime event processing recommend it over Kinesis / SQS as it allows you to customise how you handle incoming throughput. For async event processing, EventBridge is the preferred choice as it can integrate with almost all AWS services and handle processing as you wish.
d
Eventbridge is an outlier in the comparison, it is primarily about fanout, not throttling/batching, and it itself has pretty low throughput compared to this group. SNS is a closer comparable.
Otherwise, you have a solid understanding and it jives with what I know. Just know that Kafka is going to be quite a bit more initial setup than any of the native AWS stuff, and likely more expensive.
a
Yep, I agree but when async processing comes into play, throughput is usually not the concern. async processing (in AWS context) usually tries to use other AWS services which EventBridge integrates with very well.
d
Makes sense, I was only mentioning due to your first message:
The incoming throughput can go very high ~ 2000 req/s
a
I am really not worried about the cost, regarding the setup, I believe AWS MSK will definitely make life easier, I got in touch with a couple of people at an analytics company who process at a scale of billions and they say MSK is a breeze compared to a K8s Kafka operator. Oh right, I need realtime processing and so EventBridge is a no go.
Thank you @Derek Kershner. 😁🙏
d
MSK is a breeze compared to a K8s Kafka operator.
With this, I can fully agree.
I believe AWS MSK will definitely make life easier.
With this, I think you need to add words:
because our use case is very complex, and we are thinking about the long run.
a
My use cases would never overgrow that of an analytics company. We’re currently using a third-party services for analytics, ads, marketing, etc. My use case is to make something custom in-house so that I can save cost. They’re very pricey like 10 to 50 x of my API costs and plus I am unable to audit their stats because when I aggregate cloudwatch logs data the numbers differ by 30 - 35% and that’s huge. A custom solution could also integrate better and rely on events directly from the backend and make my frontend apps lighter. So, that’s what I am planning to do.
d
I see, with that information, I think you are on the right track, but I wouldn’t necessarily discount kinesis just because some folks choose MSK. I’ve never been in your particular shoes (using Kafka for throughput reasons), but I was part of a decision on centralized eventing (where we chose Eventbridge over MSK for simplicity and serverless pricing). Your decision is quite different, though, and good luck!
I would add
Firehose
to your decision set as well. It uses Kinesis, but has a different pricing model (pure consumption), analytics related transformation capabilities, and is specifically about throughput (extremely high).
a
Thank you, yep, can’t hurry with such impacting decisions, I am going to take my time and ingest TBs worth of info and feedback before I choose my way ahead. Obviously, I’ll keep this community informed or even write a blog post. 😅
d
Sounds good to me, please personally @ me if you do either of those last ones.
a
Sure, I will.
a
Kafka is really powerful, but overkill for point to point queuing. Kafka is an event bus. You can replicate it with AWS core services by combining SNS with a combination of SQS queues and Kinesis data streams in various configurations; which is more complex than just using Kafka. That complexity can be managed by making your own CDK contstructs. Your original description was just buffering requests from one lambda to another though. For that, SQS.
a
Yes, for the feature that I plan to implement in the coming weeks I’m going to go with SQS. I believe the FIFO feature would even reduce the duplicate events that might come and save me lot of money. The Kafka comes into play for near-realtime non-user impacting services where the throughput would not be dependent by the user but by my configuration. That’s where I might resort to MSK instead of EventBridge. There are many services which actually don’t need to be realtime but I know my stakeholders and their sporadic feature requests and so I’d rather make them near-realtime from the beginning. Also, thank you for your assistance @Adam Fanello. I appreciate it. 😁🙏