Any tips to debug a long running stack creation deletion I a Serverless Stack #help

Any tips to debug a long running stack creation/de...

Seth Geoghegan

01/03/2022, 5:59 PM

Any tips to debug a long running stack creation/deletion? I am using the sst.Script construct to manage DB migrations. I've written a simple lambda that creates a database

onCreate

and removes a database

onDelete

. I've tested by putting the lambda behind the sst.API construct and it works fine. However, when I define the sst.Script construct with the same lambda,

sst start

and

sst remove

take very long (20+ minutes). Without the Script construct, these actions take around 1-2 minutes to deploy from scratch. Any tips to debug this, or is this the expected behavior with custom resources?

thdxr

01/03/2022, 5:59 PM

if you make the script do nothing does it still take 20min?

Seth Geoghegan

01/03/2022, 6:00 PM

Funny you should ask that. I tried that the other day and it was faster. I will try again shortly (after the stack finally removes) to get a concrete example

thdxr

01/03/2022, 6:01 PM

I remember seeing this when my script errored out and cloudformation would get stuck

thdxr

01/03/2022, 6:01 PM

and there's some 20min timeout somewhere

Frank

01/03/2022, 7:02 PM

hmm.. when it took 20+ minutes to run, did the script fail in the end? Or did it successfully run? @Seth Geoghegan

Matt Morgan

01/03/2022, 7:04 PM

Sounds like a VPC-bound function. Sometimes stack removal can hang on deleting ENIs for ~20 minutes.

Matt Morgan

01/03/2022, 7:04 PM

And sometimes that doesn't happen. It's a Cfn thing, not CDK/SST.

Matt Morgan

01/03/2022, 7:05 PM

More info if interested: https://forum.serverless.com/t/very-long-delay-when-doing-sls-remove-of-lambda-in-a-vpc/2535/10 (from sls forums but the issue similarly isn't related to sls)

Seth Geoghegan

01/03/2022, 7:08 PM

@thdxr I commented out the lambda to a bare-bones implementation and it still took 21 minutes to remove. However, I think @Matt Morgan is onto something with his comment, as I did put this lambda function inside of a VPN. I assume this is necessary because it's running migrations against an RDS database.

thdxr

01/03/2022, 7:08 PM

ahh

Seth Geoghegan

01/03/2022, 7:09 PM

Another strike against RDS in my book 🙂

Matt Morgan

01/03/2022, 7:11 PM

Yup, pretty much. If I can't get the best devexp, I'm not that interested.

Seth Geoghegan

01/03/2022, 7:17 PM

Agreed. Perhaps I'll create a script to run migrations locally or trigger the lambdas some other way. Although I prefer the custom resource approach,

yarn run db:migrate

or similar doesn't seem awful.

Seth Geoghegan

01/03/2022, 7:17 PM

@Frank To answer your question, it did not fail. It completed, but just took a looong time to do so 🙂

Matt Morgan

01/03/2022, 7:17 PM

Why do you remove the function after running it? Just to ensure it doesn't run again?

Seth Geoghegan

01/03/2022, 7:29 PM

Oh, I'm just testing the entire workflow to ensure it operates properly. I'm creating a long-lived RDS "dev" instance for a team of developers. I want each team member to check out the app, run

sst start

and have a database created for them within the development RDS instance. Using this approach, I hope to eliminate the need to run a local DB in docker (what they do today) and push all development for this particular service to the cloud. I'm mostly trying to ensure that

sst start

and

sst remove

deliver a good experience. I suppose it's not the end of the world if

sst remove

is slow, since it shouldn't happen terribly often. Just exploring options 🙂

Seth Geoghegan

01/03/2022, 7:30 PM

To be clear, I'm not specifically removing the function itself. I'm just kicking the tires on a typical developer workflow to make sure it feels good

Matt Morgan

01/03/2022, 7:31 PM

Makes sense. I know there are some workaround/hacks to this problem. My org has just sucked it up when necessary and generally avoided VPCs if we can.

Seth Geoghegan

01/03/2022, 7:34 PM

Yeah, this could very well be a "suck it up" moment.

Seth Geoghegan

01/03/2022, 8:56 PM

@Frank Spoke too soon. Seeing timeouts on

sst start

as well. I doubt there is anything SST specific to do here, just providing this for insight:

Copy code

sgeoghegan-my-ts-sst-app-kas-migration | CREATE_FAILED | Custom::SSTScript | AfterDeployScriptResource204FEBCF | Received response status [FAILED] from custom resource. Message returned: connect ETIMEDOUT 54.161.169.125:443

Logs: /aws/lambda/sgeoghegan-my-ts-sst-app--AfterDeployonCreateFunct-SjMUZ26yDHhS

    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1159:16) (RequestId: a81d3c7b-a1f9-46c0-86b9-fcda6e880ab8)

Seth Geoghegan

01/03/2022, 8:58 PM

Here's a Github Issue thread on Serverless Framework discussing the issue.

Frank

01/04/2022, 9:32 AM

Interesting.. I think we could manually remove the ENI similar to what this plugin does mentioned in the issue above https://github.com/medikoo/serverless-plugin-vpc-eni-cleanup

Frank

01/04/2022, 9:33 AM

@Seth Geoghegan is the

TCPConnectWrap.afterConnect

timeout on

sst start

related to the

sst remove

timeout?

Seth Geoghegan

01/04/2022, 1:39 PM

@Frank That is my assumption, since it only happened when I introduced the SST Script construct. Removing the Script construct eliminates the timeout. However, I have not seen a specific error that confirms the two are related. If it helps, this is how I'm defining my Script

Copy code

new sst.Script(this, "DbMigrate", {
          defaultFunctionProps: {
            environment: { 
              SECRET_ARN: props.databaseSecretArn,
              DB_NAME: scope.stage,
              DEFAULT_DB_NAME: props.defaultDatabaseName
            },
            vpc: Vpc.fromLookup(this, 'VPC', { vpcName: props.vpcName }),
            permissions:["secretsManager"]
        },
        onCreate: "src/script.dbMigrate",
        onDelete: "src/script.dbMigrate",
    });

Frank

01/04/2022, 5:52 PM

I see. And the error only happens sometimes?

Seth Geoghegan

01/04/2022, 5:59 PM

I think a few things are happening here. 1) If my sst.Script has an error (e.g. timeout while trying to connect to RDS), I can get a

TCPConnectWrap.afterConnect

timeout. This timeout is a result of my crappy code 🙂 2) Setting up and tearing down the stack can be slow, particularly tearing down the stack. I'm consistently seeing 21 minutes for

sst remove

to completely remove a fairly basic REST API. I haven't seen this slowness create a timeout yet. The tricky part is this is painful to diagnose since each create/remove takes a loooong time 🙂

Frank

01/25/2022, 12:52 AM

@Seth Geoghegan sorry for the late follow up. Without the

sst.Construct

does it take ~20min to

sst remove

Open in Slack

Previous Next