I m building a feature which is going to be writing events t Serverless Stack #help

I’m building a feature which is going to be writin...

Ashishkumar Pandey

02/15/2022, 5:02 PM

I’m building a feature which is going to be writing events to SQS from a lambda. The incoming throughput can go very high ~ 2000 req/s and these would be processed in realtime . Should I be using lambda and SQS for this or something else. Anyone processing data at such scale, how are you approaching it, should I be looking into MQTT or Kafka. I’m worried about back-pressure and throttling.

thdxr

02/15/2022, 5:03 PM

What does this come in from - apig?

Ashishkumar Pandey

02/15/2022, 5:04 PM

yep, apigv2.

thdxr

02/15/2022, 5:05 PM

apigv2 -> lambda -> sqs -> lambda?

Ashishkumar Pandey

02/15/2022, 5:05 PM

yep.

thdxr

02/15/2022, 5:05 PM

If your messages sizes are small and you only have a single consumer (not pub sub) then your setup is probably what's best afaik

thdxr

02/15/2022, 5:05 PM

The other option is Kinesis which is more of a pain but allows things like having different consumers process the same messages / replay

thdxr

02/15/2022, 5:06 PM

Otherwise your sqs + lambda setup is simple and should scale very nicely - make sure you set your queue consumer options so you're saving money

thdxr

02/15/2022, 5:07 PM

invoking lambdas in batches of 1000 instead of 1 for example

Ashishkumar Pandey

02/15/2022, 5:08 PM

yep, single consumer. the consumer lambda just validates the event payload and writes to dynamodb. How do I decide my batchsize, supposing that the throughput is low, say 100 req / s but the events need to be processed immediately and my batchsize is set for 1000?

Ashishkumar Pandey

02/15/2022, 5:09 PM

this would put a 10 second wait, right?

thdxr

02/15/2022, 5:09 PM

no aws should invoke your lambda in parallel

thdxr

02/15/2022, 5:09 PM

there's 2 parameters on the consumer one sec

Ashishkumar Pandey

02/15/2022, 5:09 PM

oh okay.

thdxr

02/15/2022, 5:10 PM

so it's

maxBatchingWindow: time

and

batchSize

Derek Kershner

02/15/2022, 5:10 PM

the default is to poll for 20 seconds in an attempt to reach the batchSize, and if not, process whatever there is.

thdxr

02/15/2022, 5:10 PM

Basically aws will trigger a lambda when either max batching window is hit or batchSize

Derek Kershner

02/15/2022, 5:11 PM

max batching window adds polls to this, for instance if 40 seconds, it would poll twice

thdxr

02/15/2022, 5:11 PM

So if you set it to 10 seconds + 1000 items, worst case your events will be processed with a 10 second delay

Ashishkumar Pandey

02/15/2022, 5:11 PM

Ah! good catch, perfect, this solves my dilemma. Great. 🙏

thdxr

02/15/2022, 5:12 PM

once you have this going you'll probably feel like "wow this is magic" because it'll perfectly scale

Ashishkumar Pandey

02/15/2022, 5:12 PM

thank you @thdxr and @Derek Kershner. I appreciate it.

Ashishkumar Pandey

02/15/2022, 5:12 PM

Yeah, another testimonial is coming lol in the next 2 - 3 weeks. 😂😂😂

thdxr

02/15/2022, 5:12 PM

A company I advise made a mistake last week and accidentally 100x'd the events they were receiving. Nothing was effected, sqs scaled, lambdas scaled and dynamo scaled

thdxr

02/15/2022, 5:12 PM

cost an extra $35

thdxr

02/15/2022, 5:12 PM

for an otherwise system destroying mistake

Ashishkumar Pandey

02/15/2022, 5:13 PM

My company still doesn’t believe that we serve ~ 10 M API requests in $85. 😂

Derek Kershner

02/15/2022, 5:14 PM

just in case its important, the 20 second poll time setting is

receiveMessageWaitTime: Duration.seconds(20)

, and it is on the Queue itself (Dax’s are on the event source)

Derek Kershner

02/15/2022, 5:15 PM

to achieve 10 seconds, you would need to lower this as well

Ashishkumar Pandey

02/15/2022, 5:15 PM

oh, good eye!

Ashishkumar Pandey

02/15/2022, 5:17 PM

I should keep this queue as FIFO, right? Any idea how FIFO guarantees ordering?

Derek Kershner

02/15/2022, 5:18 PM

on ordering, it just orders based on time received to my knowledge. duplication is the stronger use case, and they provide an id param.

Ashishkumar Pandey

02/15/2022, 5:20 PM

Ah! de-duplication would be lovely, thanks I’ll venture into more depth on the fifo based features. Thank you once again for all your assistance @thdxr and @Derek Kershner. 😁🙏

Derek Kershner

02/15/2022, 5:20 PM

FIFO is 20% more expensive, if you can make it idempotent, you should

Ashishkumar Pandey

02/15/2022, 5:23 PM

If fifo saves me time, I’ll prefer fifo as the feature will make me more money than what fifo might cost. I really don’t want my frontend logic to become convoluted to ensure idempotency. I’ll think on it, thank you for the heads-up.

Derek Kershner

02/15/2022, 5:24 PM

makes sense to me

Derek Kershner

02/15/2022, 5:25 PM

I just looked it up, I was thinking of FIFO SNS, FIFO SQS aint so bad (20% more).

Ashishkumar Pandey

02/15/2022, 5:26 PM

Yeah, on-demand is dirt cheap. 😂

Adam Fanello

02/15/2022, 5:26 PM

Lower bandwidth on FIFO queues too, so watch out for that. The place I'd use Kinesis for is very high bandwidth and sharding data to help (but not guarantee) order. It's popular for IoT data ingestion, sharding on the device ID so that the same Lambda processes all messages from the same device.

Ashishkumar Pandey

02/15/2022, 5:30 PM

I am going to take a day and do some in-depth reading but from what I understand AWS MSK is managed Kafka as a service and almost all large scale enterprises that rely on realtime event processing recommend it over Kinesis / SQS as it allows you to customise how you handle incoming throughput. For async event processing, EventBridge is the preferred choice as it can integrate with almost all AWS services and handle processing as you wish.

Derek Kershner

02/15/2022, 5:31 PM

Eventbridge is an outlier in the comparison, it is primarily about fanout, not throttling/batching, and it itself has pretty low throughput compared to this group. SNS is a closer comparable.

Derek Kershner

02/15/2022, 5:33 PM

Otherwise, you have a solid understanding and it jives with what I know. Just know that Kafka is going to be quite a bit more initial setup than any of the native AWS stuff, and likely more expensive.

Ashishkumar Pandey

02/15/2022, 5:34 PM

Yep, I agree but when async processing comes into play, throughput is usually not the concern. async processing (in AWS context) usually tries to use other AWS services which EventBridge integrates with very well.

Derek Kershner

02/15/2022, 5:36 PM

Makes sense, I was only mentioning due to your first message:

The incoming throughput can go very high ~ 2000 req/s

Derek Kershner

02/15/2022, 5:37 PM

Amazon EventBridge quotas - Amazon EventBridge

Ashishkumar Pandey

02/15/2022, 5:37 PM

I am really not worried about the cost, regarding the setup, I believe AWS MSK will definitely make life easier, I got in touch with a couple of people at an analytics company who process at a scale of billions and they say MSK is a breeze compared to a K8s Kafka operator. Oh right, I need realtime processing and so EventBridge is a no go.

Ashishkumar Pandey

02/15/2022, 5:40 PM

Thank you @Derek Kershner. 😁🙏

Derek Kershner

02/15/2022, 5:40 PM

MSK is a breeze compared to a K8s Kafka operator.

With this, I can fully agree.

I believe AWS MSK will definitely make life easier.

With this, I think you need to add words:

because our use case is very complex, and we are thinking about the long run.

Ashishkumar Pandey

02/15/2022, 5:45 PM

My use cases would never overgrow that of an analytics company. We’re currently using a third-party services for analytics, ads, marketing, etc. My use case is to make something custom in-house so that I can save cost. They’re very pricey like 10 to 50 x of my API costs and plus I am unable to audit their stats because when I aggregate cloudwatch logs data the numbers differ by 30 - 35% and that’s huge. A custom solution could also integrate better and rely on events directly from the backend and make my frontend apps lighter. So, that’s what I am planning to do.

Derek Kershner

02/15/2022, 5:54 PM

I see, with that information, I think you are on the right track, but I wouldn’t necessarily discount kinesis just because some folks choose MSK. I’ve never been in your particular shoes (using Kafka for throughput reasons), but I was part of a decision on centralized eventing (where we chose Eventbridge over MSK for simplicity and serverless pricing). Your decision is quite different, though, and good luck!

Derek Kershner

02/15/2022, 5:56 PM

I would add

Firehose

to your decision set as well. It uses Kinesis, but has a different pricing model (pure consumption), analytics related transformation capabilities, and is specifically about throughput (extremely high).

Ashishkumar Pandey

02/15/2022, 5:58 PM

Thank you, yep, can’t hurry with such impacting decisions, I am going to take my time and ingest TBs worth of info and feedback before I choose my way ahead. Obviously, I’ll keep this community informed or even write a blog post. 😅

Derek Kershner

02/15/2022, 5:59 PM

Sounds good to me, please personally @ me if you do either of those last ones.

Ashishkumar Pandey

02/15/2022, 5:59 PM

Sure, I will.

Adam Fanello

02/15/2022, 5:59 PM

Kafka is really powerful, but overkill for point to point queuing. Kafka is an event bus. You can replicate it with AWS core services by combining SNS with a combination of SQS queues and Kinesis data streams in various configurations; which is more complex than just using Kafka. That complexity can be managed by making your own CDK contstructs. Your original description was just buffering requests from one lambda to another though. For that, SQS.

Ashishkumar Pandey

02/15/2022, 6:05 PM

Yes, for the feature that I plan to implement in the coming weeks I’m going to go with SQS. I believe the FIFO feature would even reduce the duplicate events that might come and save me lot of money. The Kafka comes into play for near-realtime non-user impacting services where the throughput would not be dependent by the user but by my configuration. That’s where I might resort to MSK instead of EventBridge. There are many services which actually don’t need to be realtime but I know my stakeholders and their sporadic feature requests and so I’d rather make them near-realtime from the beginning. Also, thank you for your assistance @Adam Fanello. I appreciate it. 😁🙏

4 Views

Open in Slack

Previous Next