Cloudflare #durable-objects

kenton

07/24/2021, 6:57 PM

the hang is a bug, and I think I might know what it is. But I don't know why your object is being reset.

john.spurlock

07/24/2021, 7:04 PM

Hmm, ok - it would be great if there was a runtime api I could call to get the current memory usage, then I could a/b different approaches and see the effect, although the current state seems too constrained for an advertised 128mb limit. Anyway, happy to retest under any platform changes whenever, since the object-resetting-without-error issue is 100% reproducible now. Instances just below the 12.5mb estimated memory usage work fine, and the ones just above live only for one request (so I can't work on anything until it's fixed).

kenton

07/24/2021, 7:10 PM

again, if the serialized JSON size is 12.5MB, the in-memory size is probably much larger. But... probably not 10x larger, I guess. I wonder, though, if the hanging requests are also leading to multiple list() operations being performed at the same time, possibly creating multiple copies of the data in memory?

john.spurlock

07/24/2021, 7:25 PM

No hanging requests now that I've chopped up the

list

into multiple sequential calls as a workaround with a

limit

of 512 (nine sequential calls to be exact), so that's probably not a factor with the memory issue. I'd like to bump this limit back up once the hanging issue is addressed, since this takes far longer than the single

list

did prior to last week. It was so fast!

kenton

07/24/2021, 7:55 PM

I've identified the bug that causes hangs, and it appears to happen if you

list()

more than 16MB worth of data at once -- counting the serialized size of the data. So if you find you have to split your list 9 ways to avoid it, that seems to imply the total serialized data size is 144MB...

john.spurlock

07/24/2021, 9:09 PM

thanks, that's something! what does "serialized size of the data" mean just there? Is it something I could calculate in JS land? IIRC the 32k advertised limit of a DO storage value roughly (but not exactly) matched the json serialized length of the stored js object: I put some slightly larger and smaller ones to test when DO first came out and observed which ones succeeded and failed. Eh I don't think my total is 144mb, the total serialized size of all values (to json) was ~12.5mb, that would be quite some overhead. I'm using a low limit to be conservative, probably don't need nine calls, it was just unknown to me if it was even deterministic : )

kenton

07/24/2021, 9:21 PM

I'd expect V8 serialization to be more efficient than JSON. I'm not sure what to say here... it seems like your data is bigger than what you say. Maybe JSON serialization is actually dropping some of the content? JSON can't serialize as many types as V8 serialization can, and it may be inserting empty objects in place of things it doesn't know what to do with. For example, if you try to JSON-serialize a

Map

, it'll always come out as

{}

-- the actual content is totally omitted.

john.spurlock

07/24/2021, 9:31 PM

For these object instances in question, there are 4096 entries in DO storage, each key is a fixed string of length 12, each value is a single-level json object (

Record<string, string>

in ts lingo) so should be the simplest thing to serialize. Average

JSON.stringify(value).length

is ~3.1kb. I'm pretty sure something changed, since these particular objects did not hit memory limit resets before last week, and no data/code has been modified since then.

kenton

07/24/2021, 11:25 PM

It's still entirely possible that the memory limit is not the problem at all. It'd help if we could get an actual error message. Here's something to try: Create an endpoint that always hangs, e.g. does

await new Promise(() => {})

. Start one hanging request first. Then, try to trigger the object reset. When the object is reset, the hanging request should be canceled and should throw an exception from

stub.fetch()

in the stateless worker that called it. If you could catch that exception, it should tell you what actually went wrong...

john.spurlock

07/24/2021, 11:37 PM

hehe interesting - I'll try that. I figured since the request succeeds, it was not erroring at all, simply getting reset by a similar mechanism that kicks in when a DO has not had an incoming request in a while, since it's exactly the same behavior, except it happens right away

john.spurlock

07/25/2021, 12:10 AM

Error: Promise will never complete.

kenton

07/25/2021, 12:14 AM

oh... I guess it's a little too clever. The resolver was garbage-collected so the workers runtime figured out the promise would never finish. You might have to do

new Promise(resolve => setTimeout(resolve, 1000000000))

, i.e. an impossibly long timeout

john.spurlock

07/25/2021, 12:20 AM

ok with that new promise, the endpoint hangs, but never errors out. I'm able to do data requests to the same DO by name, but it still gives me a new instance every time. I thought the hang endpoint might keep the instance alive while the http request is alive, but it appears that two instances of the same DO are running concurrently? Doesn't that break the contract?

kenton

07/25/2021, 12:21 AM

Yeah that is not supposed to happen. At this point I may need to ask you to put together a minimal app reproducing the issue that we can debug...

john.spurlock

07/25/2021, 12:28 AM

Ok sure, I can do that tomorrow. Since it is data triggered, I suppose it will need a data loading endpoint for setup as well. Is a worker script js ok? I don't use wrangler.

kenton

07/25/2021, 12:29 AM

Yeah we should be able to work with just the JS. And yeah I suppose it'll need some code to generate dummy data...

kenton

07/25/2021, 12:30 AM

Oh, but it might be worth waiting until after this coming week's release to see if my bug fix somehow makes the problem go away...

john.spurlock

07/25/2021, 12:33 AM

Six days? That's an eternity : ) I don't mind putting together a repro if it will help get this figured out.

kenton

07/25/2021, 12:34 AM

ok, sounds good.

john.spurlock

07/26/2021, 12:04 AM

Started a project here: https://github.com/johnspurlock/workers-do-memory-issue Doesn't repro yet though, so it's probably too minimal, I'll let you know when it does. Would code size have any effect? I've noticed that the initial load from storage is 10x faster too (even with 9 calls), which is strange, since that's essentially the same code as the real one. Maybe my real worker DO is cursed with old, slow storage.

Wallacy

07/26/2021, 5:06 AM

I’m not saying that’s the problem. But two months ago I got a problem that I solved removing everything (using —delete-class) and upload again. But probably was just a coincidence…..

brett

07/26/2021, 2:06 PM

It's all the same storage, but I have a hunch that your previous DO had failed over to a different colo -- we have work to do to get them to move back to their original/primary location more quickly. If you DM me the object ID and account I could verify that

john.spurlock

07/26/2021, 3:15 PM

When a DO moves, does the storage move too? or can a particular instance get into the situation where all of the (uncached) storage calls go to a different colo?

brett

07/26/2021, 3:25 PM

Well, storage is replicated, but they can be in a state where they might have to read from another colo, which I think is what is likely happening in your case. That's something we will improve/fix though

john.spurlock

07/26/2021, 3:30 PM

Anyway, I think I figured it out. In my real worker, the client worker actually makes 16 parallel calls, to 16 different DO instances (16 different names passed to idFromName). I noticed that when I did a single call to the problematic instances, they behaved just fine (speedy list calls, and did not reset on subsequent calls). Actually when I made anywhere from one to about 12 parallel calls, they did not reset after the first call.

john.spurlock

07/26/2021, 3:31 PM

Once I got up to about 14 to 16 (the full data set), I noticed the issue again. It appears that many of the DO instances are actually being instantiated in the same isolate (!!!)

john.spurlock

07/26/2021, 3:35 PM

So when a bunch of these guys are crammed into the same isolate, not only are they presumably subject to a fixed memory limit that is now disproportionate, they are competing for async api storage access (since they are now in the same isolate), both of these would explain why they slow down dramatically in the case (but don't fail), and then are likely restarted after the pending requests complete.

john.spurlock

07/26/2021, 4:20 PM

This seems like a platform bug - each DO instance should have its own memory limit, right? And what about any 1st or 3rd party code that sets a global callback or something, that would break if two different DO instances shared the same isolate, not to mention the two DOs could directly talk to each other when this happens. Seems like it breaks the conceptual model.

brett

07/26/2021, 4:47 PM

Instance-specific state should be kept inside of the class, it's true that if objects are scheduled on the same process they may share global state, which is assumed to be useful for optimizations (caching things that can be used by any object)

brett

07/26/2021, 4:48 PM

That isn't to say that it's not confusing. Maybe something we need to revisit, or at least better document