<@U0BJHDX1Q> <@U01EY27APNH> I've talked about this...
# adobe
d
@priyank_adobe @Mark Takata (Adobe) I've talked about this before, but haven't addressed it directly to Adobe, and now I am. This is very frustrating, and hurts both CF's credibility and mine with the powers that be. We're experiencing recurring periods where a lot of nonsense errors are thrown. In the most recent case, nothing was promoted up to production in the past week, so zero code changes, and today mid-day we got a blast of these: • Element EXECUTIONMODE is undefined in THISTAG • Could not find the included template, referring to a template that DOES exist and is used on lots of pages These are the same errors we've seen before. Clearing the template cache fixed it, as it has in the past, but typically it comes back some days later. Does anyone there have any idea what could be causing this? Any idea how to troubleshoot it, or better yet, prevent it? I've filed an issue about it in the tracker, but without steps to reproduce, I'm not very hopeful. But reproducible steps or not, this IS happening, and when it does, it more or less takes the site down, until we manually clear the template cache. This is a mission critical 24x7x365 app used for critical client care. Not good.
p
Mission critical alone should imply need for redundant fail over boxes in a load balanced situation to prevent outages like this.
s
@Dave Merrill Is your app built with a specific framework? Does it do any on-the-fly code-generation? Does it do any runtime manipulation of templates, or any dynamic admin operations?
m
Dave, I've passed on your note & my concerns to some members of our internal team personally. I will msg you when I have any responses or next steps.
d
Thanks all. @Patrick The app doesn't stop responding, it just throws a lot of errors that make no sense, and which stop after a template cache clear. Don't see how a load balancer could help in that case. I've considered clearing template cache programmatically when there are certain specific errors, or a lot of them, but a) I don't know how to do that, or if that's possible, and b) I'd rather fix the problem. @seancorfield No framework, big app, built over many years by a lot of people. Yay I know. To the best of my knowledge, and I think I probably would know, there's no runtime generation or modification of templates, absolutely no fiddling with the specific file that's reported as missing or erroring out, and no relevant programmatic admin stuff. @Mark Takata (Adobe) Thanks Mark, I appreciate it. I get that this is super hard to chase, because I don't have steps to repro, or an environment other than production where the problem happens. That said, it absolutely IS happening, and I'd really like to get out in front of it, so any and all help is 100% appreciated. Side note, I've never seen this happen on any of our dev or test servers, only in production.
One more data point. According to the stack trace we recorded, the "Element EXECUTIONMODE is undefined in THISTAG" error is being thrown in the same template that a bit later it says it can't find.
p
I was really just stating that if it is so “mission critical” it should be handled in a stronger environment; such has multiple prod fail over servers that when a bad instance is occurring that box can be pulled automatically with a health check of the app and still remains online via other instances for zero downtime. I wasn't really addressing the CF issue specifically at hand sorry.
d
Let me ask a silly question, is there enough HDD space on the server? I have found that when space becomes 'limited', strange things happen. For example, on a windows machine the system has a page space (usually in the root of the drive) and this may (or may not) be causing you some issues. i have found that with windows servers, a 'defrag' is required every once in a while to ensure that the space on the drive is allocated corrected. This can be done on a maintenance schedule. For Unix/Linux, this doesn't apply. However, in Linux you need to make sure where the ColdFusion server is located (any mounted drives) does have enough space. Sometimes yo need to look outside of the ColdFusion server for the reason. This is one area I would look even if you just cross it off the list of possible issues.
d
Free space: C (system): 57 G D (FusionReactor): 19 G E (ColdFusion): 16 G F (app, mostly): 112 G I'll ask our systems folks about defrag strategy, but they're generally on top of stuff. Also, if it was a system global issue like that, why would it always affect the same template, and nothing else (that we know of)?
Does anyone have an idea how much of a performance gain you get from having Save class files turned on? At a prior employer, there was a presumption that having that on wasn't a good idea, but I honestly don't remember what problems they thought it caused. I'm wondering if turning it off might possibly alleviate this. Unfortunately, if I do that, I won't know for a while or at any specific time if it's actually helping, since it's not like I can cause the error in some known way.
s
Save class files should be OFF for large applications. I'm on my phone right now but happy to explain why later when I'm at work.
d
Cheers Sean. I found @Adam Cameron's old post about the performance effects of this, which linked to Nando Breiter's post about template not found errors apparently due to turning this on. Those are both really old, and I don't see any hits about this 2019 or later, but I'm going to shut it off and see what happens. Or hopefully, doesn't.
a
Is it sooo wrong that I just chuckled at one of my own jokes:
I did mention this to the Adobe bods, but they looked at me like I was speaking [some language they didn't understand... CF perhaps],
s
Tom Jordahl @ Allaire/Macromedia was the one who originally told me Save class files should be OFF for large systems BTW. I'll read those posts and see if the explanation I planned to write would have added anything over those two...
d
We had been following Adobe's recommendations (right in the UI!), to have it on in production and off in dev. I've preemptively shut it off in production on the app that's having this problem. For now, I didn't change the other app, since it's not having the problem, but I will if I don't see any problems on the one I just did.
s
OK, so I can say this problem does affect *nix systems. And we noticed it back at Macromedia on systems that generated more than 15k+ class files (we ran Sun SPARC servers, if that matters, as I recall, at least in the early days I was there). I also observed the problem at various clients over the years since I left Adobe, on various platforms, and on various versions of CF at least up to CF9 -- it was certainly worse on Windows. The performance issue is purely around what happens after a (CF) restart -- and it is, as Adam suspected, because the file system overhead of finding/loading files from one truly massive directory seems to outweigh the cost of recompiling from scratch into memory. Railo/Lucee don't suffer from this because they store class files in a tree structure (to match the source structure) -- but it's so much faster than ACF that it never mattered anyway. I can't say whether it's improved since ACF9. I never got to run large scale tests with CF10 onward but, if it still saves class files into a single directory I would expect that to still be the same. I also can't say whether I ever correlated Save class files being ON with those intermittent template errors -- mostly because I always turned it off, after the performance problems we saw at Macromedia! I'll be interested to hear if your systems stay more reliable after turning that OFF @Dave Merrill (and I will suggest in the next maintenance window, when CF is down, you empty that
cfclasses
folder manually, so you can see whether anything does reappear).
d
Thanks for the readout Sean. FWIW, there are 6,293 files in the cfclasses directory on the cf2018 server where this is a problem, way less than 15k. In the cf9 app where it ISN'T a problem, there are 6,302 files there, not radically different, but actually more. (Before I get laughed out of town, new cf2021 servers go live in a couple weeks 🙂 ). The operating systems are different versions of Windows, no idea if that's why the different behavior. We're going to see how the problem app does with save class files shut off. Hopefully that'll settle it down.
a
Please do report back with how it all goes.
s
@Dave Merrill To be clear, the number of class files only affects performance AFAIK. The template errors you're seeing could well be triggered by this feature independently. Like I say, I hadn't made any connection between this setting being ON and those errors because once I saw the performance issues at startup, I made sure the setting was OFF on all servers -- and I don't recall seeing those template errors ever on servers with it set OFF 🙂 So my comments about the template errors are purely anecdotal but Nando's blog suggests there is a connection. I hope that turning this OFF will also prevent the template errors for you! Good luck!
d
Thanks @seancorfield, I hope it fixes this for us too 🙂 Interestingly, nobody I saw talked about "Element EXECUTIONMODE is undefined in THISTAG" in connection with this. Gut level WAG, maybe the class file was used while in the process of being written? But in any case, if the code hasn't changed for a week, why is anything messing with the class files in the first place? I'll settle for not seeing this trainwreck again, whatever the explanation. If it does fix it, And if it does fix it, Adobe should stop recommending that setting in the UI!
s
Perhaps after some time, the template cache is "full" and that template gets purged and is then recompiled to disk while it is also being loaded from disk? Some sort of race condition? Could be the JVM forcing some cleanup to reclaim memory? If only ACF was open source and we could look at how their template cache is written... 🙂
d
Something like that race condition must be happening, you're right. Maybe it has some limit to the number of cache files it can have, to avoid filling up the disk, maybe, or having "too many" files in that one directory? If you were building something like that, you'd have to be pretty careful not to let some flavor of compile/write/read conflict or race condition happen.
s
@seancorfield - interesting... i suspect that's exactly what it is. @Dave Merrill & @Mark Takata (Adobe) - i can confirm that we do see this issue very sporadically: "Could not find the included template, referring to a template that DOES exist and is used on lots of pages". it might happen a dozen times during a short timeframe, and then not again for weeks. (that's on code that hasn't been changed during that time period.) we're on CF2021 standard on Windows (AWS).
h
@SteveJ We've seen the msg "Could not find the included template" message as well, usually for an included footer file and definitely exists and is used on other pages. I narrowed it down to the client's browser timing out while downloading a large file (Android .apk file) over a slow connection.
d
So, we turned off Save Class Files a bit over a week ago, as discussed earlier, and suddenly today we get 12 of these within 2 minutes:
Element DSN is undefined in THIS
Nothing promoted since last Wednesday (it's Monday now). I cleared tempalte cache and they stompped, but I'm not 100% sure they hadn't stopped already. Grrr.
s
And did you check that the
cfclasses
folder is empty @Dave Merrill?
d
Not empty, 6k+ files. Server hasn't had a restart since I last talked about this. There aren't any since 8/12, so it's not making more, but it might have expired an existing one I guess. I'll get that cleanup into the next maintenance cycle, good idea. I shouldn't just manually delete them while the site's in use, that's just asking for trouble, right?
s
I can't remember whether it's safe to delete them while it is running -- it's been so long since I've had to manage an ACF server.
d
Meanwhile, back at the ranch... Today we had another burst of "Element EXECUTIONMODE is undefined in THISTAG" and "Could not find the included template" errors for a widely used common include that DOES exist, and which is also the file that throws the EXECUTIONMODE error. Clearing template cache appeared to fix it, possibly a coincidence, but probably not. We deleted everything in cfclasses during the last maintenance window, Save Class Files is off, and that dir is still empty, as expected. The only checked items in the 3rd section of the cf admin Caching tab are "Cache template in request" and "Component cache". Component cache probably isn't relevant, since the errors are always in or about cfms, not cfcs. Cache template in request seems harmless, since this happens without any files being changed on the server. "Maximum number of cached templates" is 0, which I'd guess means no limit, but I'm not clear if that's even in play with Trusted cache and Save class files unchecked. I'm not clear on exactly what gets cached in the "Server wide cache engine". Is that queries, only, not templates or any code artifacts? It's currently set to the default, EHCache, always has been. @Mark Takata (Adobe) did you ever hear anything from your team about these sorts of errors? They're still happening after applying the various possible solutions presented, and others have seen similar issues. I do not like this.
m
Dave, I don't think I've heard of any other folks having issues like this. I definitely don't hear about support dealing with these kinds of issues with other customers. It is really frustrating from our side because obviously we don't want you having to experience these errors, but SOMETHING is making them happen. Literally the closest thing I can think of to this happening was something like 20 years ago, on a Windows 2000 server hosting C# code, we had intermittent issues kind of like what you're having. I was working for an company that had a server team component & they were (sometimes literally) tearing their hair out. Obviously, different server, different language, different architecture, but a similar pattern where we'd have bursts of errors in files that worked the day before, and a reboot would fix it (but that meant rebooting production). In that case (not saying this is the case here, but to show you how nuts the debugging was) the Western Digital HDs our servers were running had caches that were faulty. Whenever these templates ended up in those caches to be served, the servers would blowup. But a reboot zeroed the caches out... until eventually they filled back up and boom. Again. We replaced every single WD drive with Fujitsu (I think, its fuzzy, might have gone with Seagate) and the problem never reared its head. Again, not saying that's the issue here. But if this was a widespread issue, we'd hear way more about it. It isn't, so there's something here uniquely acting to create the problem. I think we have to just keep on it bullishly until it gives up its secrets.
d
Thanks for getting back Mark. Looking back through this thread, @SteveJ and @hemi345 both said they've seen similar issues, though like me they don't have steps to reproduce it. In our case, the same template has been the problem, either not being found, or throwing that EXECUTIONMODE error, after not having been modified in a long time. It does seemingly point away from a raw disk issue I'd think, since it shouldn't change on disk if it hasn't been modified. (sorry for the premature send, will continue in a sec, wanted to get this out.)
Can you or your team shed any light on the caching settings I don't know about? • Re "Maximum number of cached templates" ◦ Does that apply given the other caching settings I mentioned? ◦ Is that cache cleared by Clear Template Cache? ◦ Is there a way to disable that cache completely, to see if it changes anything? • Re "Server wide cache engine"? ◦ What gets cached there? ◦ Is it queries only, code, or both? ◦ Is that cache cleared by Clear Template Cache? ◦ Is there a way to disable that cache completely, to see if it changes anything? Are there any other debugging steps you can suggest?
p
Have you replicated this on a brand new CF server instance to verify if it is the CF Server vs your code?
m
@priyank_adobe do we have any detailed docs on those settings someplace?
s
You want template caching enabled or else you'll have horrible performance, with pages being repeatedly compiled from disk (right, @Mark Takata (Adobe)?).
m
I mean, yes, technically you want that in a production environment. I think this was an attempt to get to the root of why they were encountering these templates acting as though they were not being found (which could I guess be explained if the template cache was corrupted in some way). Optimally you do want to cache everything in production, but performance optimization in this space is much more @carehart's domain than mine.
s
When I was freelance, I had clients who encountered that error (on older versions of CF, obvs, since I stopped doing freelance CF a long time ago). So it's def. more widespread than you think. But it's one of those very weird, very rare bugs and restarting CF usually clears it for another couple of months. I don't know of anyone who has managed to repro it. I suspect it's some sort of race condition deep inside either the compiler or the template cache (or a combination thereof), which means it will be super hard to repro and debug 😞
Folks who are doing fairly regular deployments typically don't see it because they're restarting CF more often. I suspect it only happens on "large" codebases and only under certain load conditions too. All making it even harder to repro/debug.
c
Wow, that was a long thread (going back to teh start). And now I'm literally just about to jump on a meeting in 4 mins. So quickly, @Dave Merrill, one thing you said that is NOT at all normal is that you set the template cache size (in the CF Admin caching page) to 0, wondering of that may mean "unlimited". I have never found clarification (or done testing) to confirm if that means "unlimited" or may well mean "none"--which would be the bad thing Sean just mentioned. To be clear, the default is 1024. And what is SHOULD be is generally driven (in my opinion) by the size of that cfclasses folder (that's populated if "save class files" is enabled), as it's an indication of how many templates (CFM or CFC or their methods/functions) have been compiled over time. If you turn it off, then it's purely a guessing game what size it should be. Or sure, one could argue an "unlimited" size might seem preferable. Let's see if Adobe can confirm what 0 means there. As for the rest, yes, problems related to all this COULD be at the root of your odd errors. They are odd, not common. Yes, some get them, but not most. And as Mark said, maybe something about your environment is causing it...like perhaps this setting to 0. Finally, as for that other "server wide" caching info on the page, that about all the other caches that CF supports (query caches, page output caches, etc) and where those are stored. by default, it's ehcache, or those other options offered depending on version and edition. gotta run now
d
@seancorfield which "template caching" settings do you mean exactly? As I said, we have only "Cache template in request" and "Component cache" checked in the 3rd section of the cf admin Caching tab. And as Mark implied, if some settings change keeps this error condition from coming up, I'd take it, then worry about completely optimal performance later. That would at least mean we have something of a handle on what layer is going off the rails. @Patrick This is a relatively new server, about a year old, built from scratch. No I haven't replicated everything about it on another server exactly. The corresponding dev server is pretty similar, though not exactly the same of course, and much lower load, and has literally never had this issue. There's another dev/prod pair running another big app on CF2021, higher load, which has also never had this issue. (The one that has it is 2018.)
s
@Dave Merrill I'm not familiar with recent versions of CF so I don't know what the Admin looks like any more. To @carehart’s comment about sizing the template cache -- since "Save Class Files" is OFF, there won't be any files in
cfclasses
, so you can't use that as a guide (although I believe you have some numbers from before you turned that off and flushed that folder?). I can't say I ever remember clients (or myself) changing the default setting for the template cache, so I can't comment on the
0
value either, I'm afraid.
d
@seancorfield said
Copy code
I suspect it's some sort of race condition deep inside either the compiler or the template cache (or a combination thereof
That's my suspicion too, and what led me to shut off Trusted Cache and Save Class Files. The fact the clearing the template cache has fixed it every time so far is super suspicious. But if that's the root of it, this not-very-cached config isn't sufficient to interrupt whatever chain of events causes this malfunction, hence my questions about exactly what the other caching settings mean, and if they're relevant given the other related config. Also, if Maximum number of cached templates effectively restricts the number of files in cfclasses, then the number of files in there reflects that setting, more than it shows the "natural" number of templates the engine would keep around if given carte blanche. Whatever it means or doesn't, there were about 6k files there before we shut off saving class files and cleared the directory, and this problem was still happening. I'm not positive if the max number of tempaltes had some other value previously. It's a production server, so I have to tread a bit gently, but I'm open to other ideas to consider trying, ideally accompanied by a hypothesis of what's causing the problem, and how the proposed fix would remediate it, or give us useful info.
For instance, does it make any sense to try changing to the JCS, Redis, or Memcached engines, on the theory that it's something in caching-land that's having a problem? I know next to nothing about all of those, and about EHCache, so I'd need to do some research to even find out what I need to know about them to use them effectively. Are folks just using the default engine? Any major gotchas or benefits to be aware of with any of them? Any sense of whether this is a path worth going down at all?
c
@seancorfield, you replied to my comment noting that "_since "Save Class Files" is OFF, there won't be any files in cfclasses, so you can't use that as a guide_". Just to be clear, that's indeed what I'd said, with a bit more: "_If you turn it off, then it's purely a guessing game what size it should be. Or sure, one could argue an "unlimited" size might seem preferable. Let's see if Adobe can confirm what 0 means there._" @Dave Merrill no, it is NOT that the setting for "_Maximum number of cached templates effectively restricts the number of files in cfclasses_". The template cache holds the compiled java classes (from your CFML templates) in memory (in heap). The number limits only how many are in memory (and if too small, that could cause cache thrashing). Those compiled classes are either loaded FROM the saved class files, or from compilation of the source. I'll add that the "trusted cache" setting (on the same CF Admin caching page) tells CF not to bother checking on each request to see if the CFML source has changed. We haven't talked about that, but it has its value. I don't think it would affect your situation, but it could. What's yours set to? All this is just one of those topics about which there are SO many balls to juggle--and lots of old info, some of which is just as useful today as not, while other old info no longer applies. It got even more complicated when Adobe also started using the word "template cache" in ANOTHER way, talking about it as a facet of the "new" caching added in CF9 (using the phrase to now also refer to cached page content. Talk about confusing people!) This is entirely unrelated to what we are discussing, of course. To your last question (about changing WHERE to store cached content), I will say first that I doubt it would affect your problem. More than that, I'm not 100% sure that the template caching we're talking about (saving of compiled classes into memory) is even AFFECTED by those settings changing WHERE the other caches are stored. I suspect it may NOT be, and that the template cache is a fixed area in the heap, controlled by CF, and having NOTHING to do with ehcache or whatever caching engine you may use. (For those following along, CF2018 added new alternatives to ehcache, and CF2021 added still more. For CF Standard, only ehcache and JCS are offered. Redis and Memcached are offered also in Enterprise.)
d
@carehart Thanks for jumping in. Trusted cache, Save class files, Cache web server paths, and Use internal cache to store queries (not relevant I'm sure) are off. Only Cache template in request and Component cache (not relevant, it's paths only it says) are on. In principle I agree that the choice of caching engine shouldn't matter, it should be transparent to callers. But if the semi-random hypothesis that some sort of race condition or subsystem conflict is what's causing this, I thought maybe it might. And bottom line, it'd be great to have some sort of handle on what causes this, and/or how to avoid it.
s
My feeling on Trusted Cache (thank you for the clarification, Charlie) is that if you are not changing files on disk at all, it won't contribute one way or the other to the race condition, purely to performance (since Trusted Cache OFF means more disk checks). Now, it is possible that CF could somehow see a file on disk as changed (and needing recompilation) even when it hasn't, if something else changed the modified date on the file (I've seen that happen where a background process is scanning files periodically or where the files are on a network share -- with the latter being a pretty terrible idea for performance reasons anyway). So my overall recommendation there is to keep Trusted Cache ON to avoid disk checks and the possibility that CF might think a file has changed and needs recompilation.
The only guidance I can find online about "Maximum number of cached templates" is to set it high enough to cover the total number of files you have, as long as your JVM heap is large enough to hold that cached data. I can't find anything indicating 0 might mean unlimited (but I would expect that to be the meaning, given it is the meaning for the Maximum number of cached queries -- although that is explicitly mentioned in Adobe's docs so perhaps the lack of an explicit meaning for 0 cached templates means Adobe hasn't thought about it so 🤷🏻‍♂️ ).
c
Yep on all that, Sean. I was torn about elaborating further on that, but indeed worthwhile for Dave (and perhaps others) to consider more carefully
d
Thanks for your thoughts @seancorfield and @carehart. Seems to me that what's missing here is any kind of visibility into the compilation and caching processes and their interaction. We can't verify that the problem is, or isn't, a conflict between them, a defect in one of them, or something else. We also can't reason very well about possible triggering circumstances, or potential mitigations. Absent that sort of information, we're left with "jiggle the handle" troubleshooting -- "let's change this possibly-related config, and if it's stable for x days/weeks, maybe that fixed it" -- not ideal. @Mark Takata (Adobe) or @priyank_adobe, do you have any monitoring tools on that level you could share with us? How does Adobe refine and verify the logic and sequencing of those processes? @mflewittintergral, does FusionReactor provide anything that could help us understand the failing mechanism here? Note that these errors don't show up in FR at all, because our standard error handling catches them, logs them, and sends devs email, but that's not what I'm asking about. It's the precursors to the actual "Element EXECUTIONMODE is undefined in THISTAG" and "Could not find the included template" errors I'm looking for. Does FR know anything about the caching and/or compilation processes in ColdFusion, for instance template cache hits and misses?