Any tips to debug a long running stack creation/de...
# help
s
Any tips to debug a long running stack creation/deletion? I am using the sst.Script construct to manage DB migrations. I've written a simple lambda that creates a database
onCreate
and removes a database
onDelete
. I've tested by putting the lambda behind the sst.API construct and it works fine. However, when I define the sst.Script construct with the same lambda,
sst start
and
sst remove
take very long (20+ minutes). Without the Script construct, these actions take around 1-2 minutes to deploy from scratch. Any tips to debug this, or is this the expected behavior with custom resources?
t
if you make the script do nothing does it still take 20min?
s
Funny you should ask that. I tried that the other day and it was faster. I will try again shortly (after the stack finally removes) to get a concrete example
t
I remember seeing this when my script errored out and cloudformation would get stuck
and there's some 20min timeout somewhere
f
hmm.. when it took 20+ minutes to run, did the script fail in the end? Or did it successfully run? @Seth Geoghegan
m
Sounds like a VPC-bound function. Sometimes stack removal can hang on deleting ENIs for ~20 minutes.
And sometimes that doesn't happen. It's a Cfn thing, not CDK/SST.
More info if interested: https://forum.serverless.com/t/very-long-delay-when-doing-sls-remove-of-lambda-in-a-vpc/2535/10 (from sls forums but the issue similarly isn't related to sls)
s
@thdxr I commented out the lambda to a bare-bones implementation and it still took 21 minutes to remove. However, I think @Matt Morgan is onto something with his comment, as I did put this lambda function inside of a VPN. I assume this is necessary because it's running migrations against an RDS database.
t
ahh
s
Another strike against RDS in my book 🙂
m
Yup, pretty much. If I can't get the best devexp, I'm not that interested.
s
Agreed. Perhaps I'll create a script to run migrations locally or trigger the lambdas some other way. Although I prefer the custom resource approach,
yarn run db:migrate
or similar doesn't seem awful.
@Frank To answer your question, it did not fail. It completed, but just took a looong time to do so 🙂
m
Why do you remove the function after running it? Just to ensure it doesn't run again?
s
Oh, I'm just testing the entire workflow to ensure it operates properly. I'm creating a long-lived RDS "dev" instance for a team of developers. I want each team member to check out the app, run
sst start
and have a database created for them within the development RDS instance. Using this approach, I hope to eliminate the need to run a local DB in docker (what they do today) and push all development for this particular service to the cloud. I'm mostly trying to ensure that
sst start
and
sst remove
deliver a good experience. I suppose it's not the end of the world if
sst remove
is slow, since it shouldn't happen terribly often. Just exploring options 🙂
To be clear, I'm not specifically removing the function itself. I'm just kicking the tires on a typical developer workflow to make sure it feels good
m
Makes sense. I know there are some workaround/hacks to this problem. My org has just sucked it up when necessary and generally avoided VPCs if we can.
s
Yeah, this could very well be a "suck it up" moment.
@Frank Spoke too soon. Seeing timeouts on
sst start
as well. I doubt there is anything SST specific to do here, just providing this for insight:
Copy code
sgeoghegan-my-ts-sst-app-kas-migration | CREATE_FAILED | Custom::SSTScript | AfterDeployScriptResource204FEBCF | Received response status [FAILED] from custom resource. Message returned: connect ETIMEDOUT 54.161.169.125:443

Logs: /aws/lambda/sgeoghegan-my-ts-sst-app--AfterDeployonCreateFunct-SjMUZ26yDHhS

    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1159:16) (RequestId: a81d3c7b-a1f9-46c0-86b9-fcda6e880ab8)
Here's a Github Issue thread on Serverless Framework discussing the issue.
f
Interesting.. I think we could manually remove the ENI similar to what this plugin does mentioned in the issue above https://github.com/medikoo/serverless-plugin-vpc-eni-cleanup
@Seth Geoghegan is the
TCPConnectWrap.afterConnect
timeout on
sst start
related to the
sst remove
timeout?
s
@Frank That is my assumption, since it only happened when I introduced the SST Script construct. Removing the Script construct eliminates the timeout. However, I have not seen a specific error that confirms the two are related. If it helps, this is how I'm defining my Script
Copy code
new sst.Script(this, "DbMigrate", {
          defaultFunctionProps: {
            environment: { 
              SECRET_ARN: props.databaseSecretArn,
              DB_NAME: scope.stage,
              DEFAULT_DB_NAME: props.defaultDatabaseName
            },
            vpc: Vpc.fromLookup(this, 'VPC', { vpcName: props.vpcName }),
            permissions:["secretsManager"]
        },
        onCreate: "src/script.dbMigrate",
        onDelete: "src/script.dbMigrate",
    });
f
I see. And the error only happens sometimes?
s
I think a few things are happening here. 1) If my sst.Script has an error (e.g. timeout while trying to connect to RDS), I can get a
TCPConnectWrap.afterConnect
timeout. This timeout is a result of my crappy code 🙂 2) Setting up and tearing down the stack can be slow, particularly tearing down the stack. I'm consistently seeing 21 minutes for
sst remove
to completely remove a fairly basic REST API. I haven't seen this slowness create a timeout yet. The tricky part is this is painful to diagnose since each create/remove takes a loooong time 🙂
f
@Seth Geoghegan sorry for the late follow up. Without the
sst.Construct
does it take ~20min to
sst remove
?