Cloudflare #durable-objects

MichaelM

09/12/2021, 11:56 PM

Hmm on the phone atm so can’t verify. If I recall my experiments used default cache and request objects not string keys. Also try putting a .com or something in the URL 🤷‍♂️

dmitry.centro

09/13/2021, 8:39 AM

Hi all! Something happened with build upload format "modules" between Aug 16 and Sep 6. We made no changes in our code, but the deploy is not working anymore.

Uncaught Error: No such module "handlers/api.mjs".

It looks like it's impossible to import any file outside the root directory. In our case: wrangler.toml: [build.upload] format = "modules" dir = "src" main = "./index.mjs" rules = [{type = "Data", globs = ["\*\*/\*.html"]}] src/index.mjs: import HandlerAPI from './handlers/api.mjs'; /src/handlers/api.mjs: class HandlerAPI { ... } export default HandlerAPI;

dmitry.centro

09/13/2021, 8:43 AM

Moving api.mjs to the root directory (src) + changing the import path to './api.mjs' helps, but... keeping the whole project in a root directory is a bit strange. Our code worked before, something happened between Aug 16 and Sep 6 on the CF side

Deleted User

09/13/2021, 11:48 AM

Is there any way to browse the DO objects (like browsing the database of users) ?

brett

09/13/2021, 1:32 PM

We have a yet-to-be documented API that lets you list object IDs in a namespace (which you can then use to hit each object to do whatever you want). I have to fix a bug in it, but it appears to work for my namespaces. I can give you the details if you DM me.

matt

09/13/2021, 2:10 PM

We identified a regression for module path handing on module uploads friday afternoon -- I believe the team is looking into it this morning.

dmitry.centro

09/13/2021, 2:22 PM

Great, thanks!

ItsWendell

09/13/2021, 7:41 PM

Is the error

Durable Object is overloaded. Too many requests queued.

always because of incoming connections? If so what are those limits? Or is it possible that the durable object is doing too many

cache

storage

fetch

requests within the DO?

matt

09/13/2021, 10:09 PM

It's not arbitrary limit imposed by us, rather it's a notification that you are sending significantly more requests per second than the DO is capable of handling. Fundamentally if a given DO takes e.g 10ms of CPU time (i.e not including time waiting for cache/storage/fetch calls) to service each request, it can only serve about 1000/10 = 100 requests per second before it starts to "fall behind" and the queue of unhandled requests starts to grow. Once this queue passes a certain size, we return an error on further requests for two reasons; so your application knows to stop sending more requests, and so the memory used to store incoming requests doesn't grow unbounded. (in fact, a similar limit exists based on the number of bytes taken up by queued requests instead of the number of queued requests). These limits are high enough that any reasonable application shouldn't hit them during normal operation -- if we raised them higher, you'd instead have requests appearing to "hang" for long periods of time before completing. Normal workers don't hit this because they're stateless, we just make another one and send the request there. We can't do that for DOs, since we guarantee there's only one instance of a given DO at any time.

matt

09/13/2021, 10:10 PM

If you see these, there's a few things you can do: If your load comes in bursts, you can try exponential backoff in response to receiving this exception Or, try to split up your problem more such that each DO is receiving less load

matt

09/13/2021, 10:18 PM

If you're doing something CPU on each intensive, you can try to take it off the request path or optimize it. Another strategy that's helpful is to try to do as much work as you can in the stateless worker before you enter the DO, since stateless workers can scale to multiple instances in response to load. Parsing large request bodies, filtering in the worker instead of the DO, etc can all be helpful here!

ItsWendell

09/13/2021, 10:30 PM

@matt awesome that's a super detailed response and super helpful.

Erwin

09/13/2021, 11:58 PM

One technique I am about to experiment with is to off-load longer duration work to a new Durable Object that is created with

newUniqueId

. So I have a Coordinator DO that does the coordinating, but then hands things off to a Processing DO to do the work async.

ItsWendell

09/14/2021, 7:10 AM

Also had a similar idea! I'm basically attempting to scale websocket connections across DOs connected in a Graph (tree). Where there is a sub-tree root node coordinator in each COLO. If a worker wants to connect, it asks the root node in the COLO which node it can connect too. Each node updates it's state to the root node in the COLO. Each node has a max number of connections, so once the root node in the COLO notices that it gives you a new Node ID.

ItsWendell

09/14/2021, 7:11 AM

There's a coördinator node that connects to all the root nodes in each COLO, which only holds a simplified state of amount of connections in each COLO.

ItsWendell

09/14/2021, 7:12 AM

I think the main problem I have is caching / invalidating the latest valid Node ID, I'll re-read that Waiting Room article to see what I can improve, any tips would be great!

vans163

09/14/2021, 8:26 AM

Any progress on unbound? So websockets dont drop after reaching their CPU quota (which is so small)

Murray

09/14/2021, 10:34 AM

Re scaling/load balancing traffic for DO's, as I understand multiple DO instances will often be loaded into the same isolate - could be 10 or 100's depending upon memory size of each object. So in @User's example of 100 requests/second - I assume that is actually shared for all DO's in the same isolate, which all share the same single thread for the isolate? I understand there is a load balancing decision for new requests to spin up new non-active DO instances potentially in a new isolate if the existing isolate is overloaded - currently CPU based only (not memory aware?). But is there any smarts being worked on re load balancing or moving existing DO instances between isolates (or just hard reset)? It seems probable for an isolate (and multiple DO's within) to experience CPU or memory starvation if workloads change from the initial allocation based upon request order at the time if the original DO's remain in active use.

dmitry.centro

09/14/2021, 10:36 AM

Hi, @User ! Any updates about module path handing on module uploads? We can't deploy without it 😦

lux

09/14/2021, 4:04 PM

Is this accurate that several DO instances in the same isolate will share one CPU/memory budget? If so, then that could mean much lower limits on what you can do per DO since execution time wouldn't be 1000/n but rather 1000/(n*instances). Our use case has quite a few messages/second on a single DO and we're also balancing things by offloading additional connections to sub-DOs (sounds a lot like what @User described), but it would be great to get a better sense of the baseline performance of each DO. Currently that's feeling a little unclear/unpredictable in terms of how things will scale. Any clarification around how to assess how well individual DOs will scale would be super helpful 😃

ckoeninger

09/14/2021, 8:43 PM

Yes it's accurate. We've made recent changes that make it less likely a durable object gets scheduled on an isolate that already has a durable object from the same script. But fundamentally if there are 100 servers in a colo and you create 101 durable objects in that colo, you're guaranteed to have some sharing of memory and threads. I empathize with the feeling that's a little unpredictable. We've discussed potential options for improvement, but we're not likely to be able to pursue those options soon.

lux

09/14/2021, 8:56 PM

Hmm... I wonder if creating multiple versions of the same script and coordinating which clients connect to which endpoints would be a way around that 🤔 Knowing how many colos and isolates/colo might be useful too. Anywhere we can query that or that info is listed?

ckoeninger

09/14/2021, 10:01 PM

I wouldn't suggest doing anything based on how many colos or servers run durable objects since that is subject to change.

matt

09/14/2021, 10:23 PM

No updates yet. Did you update wrangler and start experiencing the issue, or does it happen on any wrangler version w/ module support?

ItsWendell

09/14/2021, 10:58 PM

Even though it's subject to change, it'd useful information. Does this mean that there's a limit to scaling the amount of concurrent DOs we can have per COLO? E.g. im experimenting with writing an pubsub-ish system on Workers / Durable Objects that can scale topics horizontally in a tree / graph of nodes (DOs). If there are 100 servers in a COLO, and each node is often at pretty much full capacity (e.g. 500 active connections), that means for each COLO there's only 500*100=50.000 concurrent connections possible? Since at that point it might be likely an DO will spawn in the same server / isolate, causing it to overload. And, how many isolates are there per server?

ItsWendell

09/14/2021, 11:08 PM

I'd mainly would like to know the theoretical limits of concurrent DOs across the network.

ItsWendell

09/14/2021, 11:11 PM

Ah yeah, interesting, sounds like a dirty workaround though, if each deployment does have a different isolate and the theoretical limits are per deployment, it might be possible. Possibly even to automate new deployments of your scripts.

Erwin

09/15/2021, 1:16 AM

@User, I am at the limit on my DO internals/isolate knowledge here, but I think the sharing of an isolate might not be that bad from a CPU perspective. AFAIK they would all be running their own Javascript eventloop and thus could be run on different cores. Isolates can run on multiple cores to do things like IO off the main thread and I can't imagine they would pin all the eventloops on the same core. Doesn't seem to make a lot of sense.

john.spurlock

09/15/2021, 3:21 AM

Since they share the same isolate (the same js memory space), I'd be shocked if it was not the same event loop - otherwise the entire js memory model breaks down, you'd have undefined behavior. Requests all compete for the same event loop excepting the usual async i/o etc. I'm pretty sure cpu can be tracked per request, however (a cf team mentioned this a while back, and presumably one of the reasons one request cannot do things like promises or storage calls or timeouts on behalf of another at the moment), but memory is obviously a problem since it's a shared resource among all of the instances crammed into the same isolate.

Erwin

09/15/2021, 3:54 AM

Memory is absolutely an issue when Durable Objects share the same isolate. And like I said, I am not sure they would share the same eventloop or not. But yeah, you could be very right and they would use the same eventloop. In which case it could also be an issue.