Cloudflare #durable-objects

lux

09/15/2021, 4:45 AM

Hopefully not necessary, and I guess it might also butt up against the 30 script limit, but I could see a simple coordinator deciding which script to deploy different rooms to and directing users accordingly. I should test whether that’s even necessary but I’m kind of tempted to build a proof of concept anyway haha.

dmitry.centro

09/15/2021, 7:56 AM

According to our tests - wrangler version doesn't matter. The problem appeared when we were deploying with the same wrangler version in our CI (as usual). We've tried to update wrangler - the same. Something is wrong on the other side.

ItsWendell

09/15/2021, 9:08 AM

@User Thanks for getting back on this. I think CPU usage is not the problem for me. I just send messages up and down a graph. I have a feeling WebSocket connections require quite some memory though. If I can't scale in memory at some point because all isolates are already containing a DO, that means I can only have X amount of nodes, and so concurrent connections per COLO, which I'd love to know beforehand!

ItsWendell

09/15/2021, 9:09 AM

if someone else has an answer to this, e.g. how many isolates exists per server / colo and how that scales for concurrent DO's (on a single isolate), I would love to know.

kenton

09/15/2021, 1:52 PM

A single isolate can only execute on one thread at a time. For practical purposes you can think of an isolate as single-threaded with a single event loop. (The actual implementation details are... quite a bit more complicated, but it doesn't matter from the application developer's point of view.)

john.spurlock

09/15/2021, 1:57 PM

Exactly! Good to hear - is it still the case that any cpu limits (e.g. for bundled) are tracked per request, and not per isolate, in the cases where multiple DO instances share the same isolate?

kenton

09/15/2021, 2:00 PM

They are actually tracked per-object. Each time a new request arrives, the object's CPU limit is reset to the max (currently 500ms, will be 30s when we start billing). All requests to that object draw from the same limit.

john.spurlock

09/15/2021, 2:02 PM

Ah ok - is there any difference between bundled and unbound DOs in this respect?

kenton

09/15/2021, 2:04 PM

There's actually no such thing as "bundled DO". Technically DOs are neither bundled nor unbound, they are their own separate thing, but the pricing matches unbound.

john.spurlock

09/15/2021, 2:06 PM

hmm, if they are priced like unbound, will they always be charged by wall time then? if that's the case, why the cpu limit?

kenton

09/15/2021, 2:08 PM

The CPU limit is mainly to prevent accidents, like an unintentional (or, perhaps, maliciously intentional) infinite loop that would otherwise just run forever with no way to stop it.

ItsWendell

09/15/2021, 2:09 PM

@User could you shine some light on the limits of concurrent DOs running on a single isolate per account / deployment, as mentioned a little above. Mainly regarding DOs close to their (potential) memory limit.

kenton

09/15/2021, 2:13 PM

Can you restate the question? (Also, sorry, I'm about to head off for a meeting...)

ItsWendell

09/15/2021, 2:16 PM

@User I'm working on a pubsub-ish system that scales websocket connections vertically across nodes (DOs) in a graph (tree). Once a node reached it limit, a new one will spawn and connect to the graph. This way we're not limited to a single DO to handle connections, but a theoretically infinitely scaling graph. Sub-graphs are drawn in each COLO and has a node which is responsible for tracking these graphs in that COLO. Now if there are only 100 isolates available in one COLO, that means that I could only have 100 nodes per COLO, right? Since these nodes are mostly close to their possible memory limit. The question is, how do these isolates work / scale? How many isolates are there per COLO / server?

lux

09/15/2021, 2:36 PM

We're doing something pretty similar where a "host" connects to a central DO and "listeners" connect to DOs that act as relays which connect to the host and pass on messages to their set of listeners. The host may be broadcasting several hundred messages per second, depending on the stream, which is why I'm also very interested in the same scaling questions. In our case, we don't really care about storage at all and the little state we store is cached at each layer but well within the limits to keep in memory. If the host or any relay DO is reset, the state gets recreated automatically when they reconnect. Listeners do send data as well, but at a rate of one message per second and each relay DO then tallies those and sends one message per second to the host DO which does the same, so the host user only gets one aggregate message per second from all listeners. Our median CPU time is 2.2ms and ranges from about 1.8-2.5 per execution, and things seem to be working pretty well but we're struggling to understand where we're going to run into the limits, how many relay connections the host DO can support, how many listeners each relay can support, if we need a second tier of relays, etc.

ItsWendell

09/15/2021, 2:46 PM

Interesting and nice! I do save some storage in the orchestror DO, mainly the nodes, and their respective amount of connected clients / child nodes. But sounds super similar indeed! I've been using K6, a load testing tool to experiment with scale, I still need to optimize some bottle necks with fast ramp ups of users. I'll run a slow ramp up, big test of like 20k / 50k connections soon to see how it performance now.

lux

09/15/2021, 2:52 PM

Just googled K6, looks super useful. Will be giving that a try asap. Thanks!

ItsWendell

09/15/2021, 2:55 PM

@lux My local machine was quite struggling with a large amount of connections (1k+) though, right now I'm running it on Heroku as a one-off Dyno (largest one), since you'll pay per second.

ItsWendell

09/15/2021, 2:55 PM

The cloud option is quite expensive from K6, especially if you need a lot of virtual users

lux

09/15/2021, 2:57 PM

Yeah, their pricing does seem quite high for a performance testing service. Going to see what settings I can tweak to get more connections locally and go from there 🙂

ItsWendell

09/15/2021, 3:02 PM

Yeah there's a lot in the docs explaining how to optimize your OS. For me the main problem was / is my internet connection 😅

brett

09/15/2021, 3:30 PM

As of right now it's 1 isolate per server/account/DO-namespace/script combo. So if there's 100 servers in a colo your 1 DO-namespace/script can use 100 CPU cores. 100 V8 cores can do a lot of work, but there is a ceiling. That is of course going to be improved down the line. If you really to light up a ton of cores and pay for them we'll want to spawn your more isolates. But for now other work has taken priority.

ItsWendell

09/15/2021, 3:43 PM

@brett Thanks, is there are way to know the amount of servers per COLO? @lux Now I'm thinking, to detect wether you've hit the limit of DOs per COLO, you can keep track weather a DO spawns in a isolate with an existing DO. You can do that I think by using the global scope (since they are shared in the same isolates). Then you can communicate that towards your root DO, and if that happens to often in a row. You can mark your room / channel as 'full' to prevent active DOs running in the same isolates (if that makes sense).

brett

09/15/2021, 4:10 PM

No, I think the number of servers per colo is a secret 😦 If you plan to go really wild on scaling up in the short term you could use separate DO namespaces/scripts to spread load some. :/

ItsWendell

09/15/2021, 4:14 PM

No worries, thanks for the insights anyway, it's been helpful in understanding the possible limits of concurrent DOs per COLO! Would be nice to have this a little more documented!

ItsWendell

09/15/2021, 4:24 PM

Another question, do workers share the same isolates as durable objects?

john.spurlock

09/15/2021, 4:37 PM

According to this, yes (if I'm interpreting the conversation right) https://discord.com/channels/595317990191398933/773219443911819284/873970203590021131

john.spurlock

09/15/2021, 4:38 PM

Not sure how that squares with "server/account/DO-namespace/script" combo (DO-namespace is n/a for entry-point workers?)

brett

09/15/2021, 4:56 PM

If it's the same exact script they can share an isolate yeah

john.spurlock

09/15/2021, 5:14 PM

I'm curious how eviction works - I think I've heard a single-DO instance can be evicted without taking down everything in the same isolate. Does the same mechanism that instantiates it also take care of GC-ing anything allocated in that JS instance? Is it just normal js gc for no longer referenced object instance? What if something else in the global scope has a strong handle to it?