About to start work on a feature that will probabl...
# help
o
About to start work on a feature that will probably use AWS Step Functions (SFN) - wanted to get the community’s read on serverless orchestration: • How do folks feel about SFN in production? • How’s latency at scale? Observability? Debugging? • Is it hard to write complex workflows? Been keeping an eye on some alternatives, but none seemed better enough to forgo SFN: • Temporal - I like their approach of defining workflows as code that gets executed instead of a DSL. Seems easier to reason about, for the same reason that I like CDK over CF’s JSON/YAML. Alas it has no support for Lambda, and executes its tasks in-house, but they have a cloud version. • Orkes Conductor - original engineers who built Netflix Conductor just started this co around it. Always though Conductor wasn’t suited for serverless, but they talk about Lambda support in their docs and a cloud version, so will be chatting with them soon
g
Been using step functions in prod for a few month now have 0 problems with it. Latency is not bad at all we do a lot of timer based waits and even given a 7 day wait period on one of the tasks it was only off by 0.004s which is insane (for the one I just pulled up will try and find the one off by the most now to see for comparison).
Looks like the most it has been off so far from looking at the 20 most recent executions is 0.018s so no complaints at all from me with that.
and it looks like initial startup time on the step functions are about 50-100ms before it starts actually executing what you define.
f
I mostly used SFN for async jobs, so I never looked into the latency part. @Garret Harp’s 50-100ms on initial start up sounds similar to what we are seeing as well.
Writing complex workflows are relatively easy. You can write modular workflows. And then have 1 step function invoke another step function.
Are you thikning to use Standard or Express workflow?
The former has better out of the box tooling in the SFN console.
a
A couple of comments to add to the above: • Similar to @Frank I have only used this for async workflows and not user/client initiated requests so latency I can’t speak to • Defining complex workflows is a way better experience using CDK than ASL or other IaC tools in my opinion. • Some of the tooling in the console for monitoring the progress of workflows breaks down with complicated state machines (i.e. multiple nested map states) • There are some annoying limits with regard to payload sizes and especially the execution history limit which gets into splitting into multiple state machines like @Frank mentioned. Knowing these up front and designing accordingly will save a lot of time. • As far as aggregating logs, in a recent project, we used the top-most state machine’s execution ID and propagated that to all functions / child workflows and added this as context to logs which allowed for easily identifying logs with an overall workflow run • Overall really like working with SFN, it is very powerful and effective once you get used to the quirks/limits!
o
Thanks so much folks, this has been helpful. I’ll be mostly using Standard workflows because I’ll be processing payments that have up to a 60 day processing period, but some get done in minutes. I’m also considering just going direct to DB, will depend on how complex the logic will end up being, may end up with the entire process using a few state machines, with the overall state being tracked in Dynamo