Getting watchdog up and running
# help
i
Hey @floitsch and @kasperl I try to run the watchdog implementation and get the following issues (file is following) I guess there is some sort of racecondition that the watchdog service is not yet running when my containers try to open it. Should I use
with_timeout
to open the connection or is there a different approach?
with_timeout
makes it worse. It somehow can not find the service
Copy code
******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.118>
******************************************************************************
EXCEPTION error. 
Cannot find service
  0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49
  1: ServiceClient.open        <sdk>\system\services.toit:176:40
  2: ServiceClient.open        <sdk>\system\services.toit:167:12
  3: main.<block>              C:\Users\Mirko\AppData\Local\Temp\artemis-7ca7d9eb-1271-4aa9-a13a-d976cb1b8302\clone\src\services\mqtt\mqtt.toit:58:20
  4: Task_.with-deadline_.<block> <sdk>\core\task.toit:203:16
  5: Task_.with-deadline_      <sdk>\core\task.toit:197:3
  6: with-timeout              <sdk>\core\utils.toit:181:24
  7: with-timeout              <sdk>\core\utils.toit:173:10
  8: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-7ca7d9eb-1271-4aa9-a13a-d976cb1b8302\clone\src\services\mqtt\mqtt.toit:56:3
******************************************************************************
I install the provider in my ethernet container (its the simplest one) like:
Copy code
main:
  (provider.WatchdogServiceProvider).install
  logger.debug "Watchdog provider installed"

  watchdogclient := WatchdogServiceClient
  watchdogclient.open
  dog := watchdogclient.create "mqtt-dog"
  dog.start --s=60
  logger.debug "Watchdog started"
  dog.feed
  dog.stop
  dog.close
not sure if I need to call it once but 'I wanted to give it a try
f
Hmm. This looks like the watchdog provider failed to feed the system watchdog.
It should have a second to do that.
i
do I need to update anything else?
f
I'm testing your second program now.
i
Copy code
with-timeout --ms=2000:
    watchdogclient := WatchdogServiceClient
    watchdogclient.open
    dog = watchdogclient.create "mqtt-dog"
    dog.start --s=60
I am calling it like so
and in my ble container and my mqtt container I do have a ehile true loop which simply sleeps 5 or 10s and there I am calling dog.feed (but I guess we never reach that part so far
f
I did just find a bug.
Don't yet see how it could lead to the seen issue, though.
k
The default timeout for client.open is 100ms. You can specify a higher one like this
watchdogclient.open --timeout=(Duration --s=1)
. Not sure if that is necessary.
f
Please update the package. From your descriptions it doesn't fix every issue, but there is at least one less bug in it now.
i
jaguar is lying to me
Copy code
$ jag pkg install watchdog
Info: Package 'github.com/toitware/toit-watchdog@1.1.0' installed with name 'watchdog'
in the package yaml and in the folder I still only see 1.0.1
k
jag pkg update
?
(or uninstall first)
i
yea nevermind I was running it a completely wrong folder 🙄
k
Even better 🙂
I removed the with_timeout part but this seems to be the same issue
I did not changed the timeout
Copy code
[eth] DEBUG: Watchdog provider installed

******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.118>
******************************************************************************
EXCEPTION error. 
Cannot find service
  0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49
  1: ServiceClient.open        <sdk>\system\services.toit:176:40
  2: ServiceClient.open        <sdk>\system\services.toit:167:12
  3: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-cdea2632-796b-42c6-8473-9813f0c442d2\clone\src\ble.toit:35:18
******************************************************************************

[eth] DEBUG: Watchdog started
not sure at which part exactly it is crashing but it might be the provider install
f
Does it work to run the provider in the same container?
This code works for me:
Copy code
import watchdog
import watchdog.provider

main:
  provider.main
  print "installed"

  client := watchdog.WatchdogServiceClient
  client.open
  dog := client.create "foo"
  dog.start --s=60
  print "started"
  dog.feed
  dog.stop
  print "stopped"
  dog.close
i
this is the client part - I think the provider install does not work already
f
This has the provider install in it.
(
provider.main
does an
install
)
i
ah ok
Copy code
dogprovider.main
  logger.debug "Watchdog provider installed"

  watchdogclient := WatchdogServiceClient
  watchdogclient.open
  dog := watchdogclient.create "mqtt-dog"
  dog.start --s=60
  logger.debug "Watchdog started"
  dog.feed
  dog.stop
  dog.close
this is working for me now - but I had to disable the clients in the other containers
if I try to open the watchdog service connection with the client in one of the other containers it crashes
f
interesting. Let me try that. Clearly I didn't test it enough... 😦
For me things are working now. I'm installing the provider in one container. Then use the clients in other containers to get their dogs.
What kind of crashes do you get?
k
(just a thought: is there a risk that the ethernet container is doing other work for more than 1s thus starving the watchdog provider?)
f
I would hope that it yields at least once every two seconds.
Let me create a new version with more logger entries. That should help.
i
Copy code
******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.118>
******************************************************************************
EXCEPTION error. 
Cannot find service
  0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49
  1: ServiceClient.open        <sdk>\system\services.toit:176:40
  2: ServiceClient.open        <sdk>\system\services.toit:167:12
  3: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-616970cc-d0f6-46c7-88c7-f1db1182f3eb\clone\src\services\mqtt\mqtt.toit:57:18
******************************************************************************
I receive this still - but I think this is in the mqtt container
I increased the timrout to 2s
with that it was not crashing - or at least it crashed now later with the watchdog triggering.. but I am not sure why tho
I am running this at the beginning of the main
Copy code
watchdogclient := WatchdogServiceClient
  watchdogclient.open --timeout=(Duration --s=2)
  dog = watchdogclient.create "ble-dog"
  dog.start --s=60
and then a bit later in my while loop:
Copy code
while true:
    dog.feed
    sleep --ms=5000
that should actually not trigger the watchdog
f
Not sure it's relevant, but there is no support for uninstalling the watchdog provider. If you kill that container (or install another one on top), then you will likely get a watchdog-trigger, since the original system-watchdog-loop isn't running anymore.
However that should only be an issue for one watchdog-reset. After that only the provider you installed should run.
This looks as if the watchdog timer was started, but then failed to run.
It has 2 seconds to do so. I don't see how a process would be delayed by that much.
i
it feeds the dog right after the debug line:
[ble] DEBUG: Advertising: 2d66 with name sbceade1c9c
no delay - after that I see a more or less 2s of nothing an then it crashes
f
Try to upgrade to v1.2.0, and then use the following container to install the provider:
Copy code
import log
import watchdog
import watchdog.provider

main:
  provider := provider.WatchdogServiceProvider
      --logger=((log.default.with-name "watchdog").with-level log.DEBUG-LEVEL)
  provider.install
  print "installed"
i
I marked it - it does say no connection available tho
that seem to work now
should the ethernet container also get the watchdog? it basically only installs the provider so I think it doesn't really need it, right
f
I don't think so.
If you bump the log-level (to
INFO
) you would drop the
feeding system watchdog
messages.
If you want to keep some watchdog logging.
i
have you tried it with an firmware update actually?
🥲 the watchdog triggers of course if I try to update the device
f
You would need to stop all the timers.
We will integrate the watchdog timers with Artemis a bit more. 1. if a container doesn't work, we don't need to reset immediately but can try to just restart the container. 2. depending on the settings, a container might allow to disable the watchdog automatically on fw updates. Critical containers should still have their watchdogs run. 3. Artemis will probably have the ability to disable watchdogs so it can make progress (in case one just keeps getting in the way).
We will have to think a bit more about how this should work best.
i
for the sake of testing right now it doesn't matter, I mostly flash anyway
f
Oh. And the watchdog-provider container should probably be marked as critical.
i
I would not know when to close the watchdog for a firmware update tbh
f
For now I would probably mark the watchdog-provider container as critical, and use very big timeouts for the watchdogs.
Just to make sure something is happening within 10 minutes or so.
At the very least the devices should never get fully stuck this way.
i
you mean the feeding should happen only once every few minutes?
f
The timeout should be set to 10 minutes. The feeding could still be more frequent.
In the current setup.
i
ah so the timeout is the time I set when creating the watchdog
f
This would give Artemis a window of 10 minutes to do its thing.
Correct. At
start
you tell the watchdog the max interval between feedings.
i
dog.start --s=60 <---- this
okay
f
You have 1 minute to feed it. Yes.
If you make this 10 minutes, and feed every 30 seconds, then Artemis has 9:30 to do its thing.
(when it shuts down your container for a firmware update).
Again: in the future we should improve this.
i
yes yes
f
That said: I'm actually not 100% sure this will work.
I think a
deep-sleep 0
with a watchdog running triggers a watchdog error.
Not sure if we then try to update.
I will test that.
i
I am almost pretty sure, that this will not catch my issue with the device disappearing but I really hope so
okay the watchdog is running on the device and writing the log into a file
I only watch ble and mqtt I guess the others do not really need a watch
f
Looks good. I was able to do a fw update with the watchdogs active.
i
pretty long log but at some point there was a 502 and then the synchronise job stopped printing its debug synchronise log! https://cdn.discordapp.com/attachments/1171046614731862088/1171168381639065640/message.txt?ex=655bb28d&is=65493d8d&hm=be7ea56f9403db3d90c329dd3c9699606bae9f923ec75b60e3db0364d77eaf52&
and few minutes later the whole device stopped working
k
But the watchdog still didn't force a reboot?
i
nope the log stopped completely as well
I really wonder what that could be
f
So the synchronize stopped but the watchdogs were still fed?
It looks like we need to add a watchdog for our own Artemis code to avoid this.
k
As I understand it, the watchdog feeding code - BLE + MQTT containers - also stopped, but nothing rebooted.
i
exactly
k
Does the device have any LEDs?
i
yes
k
Anything we can blink on regular intervals?
i
let me check
it has a bunch of leds to show Rx and Tx of the ethernet
but I will chekc if it has some debug led on it
no debug led but I could maybe add one
the way I check if the device is still alive is, I check for the BLE device on my phone. if it is gone, the container is gone
k
You also get no serial output, right?
i
exactly, nothing is printed there
only if I hard reset the device using the reset button
k
So we don't see any prints from the watchdog provider, so we assume that it actually isn't resetting the low-level watchdog, which should cause that to reboot the system. It's pretty weird.
It would be great to rule out that we're just not getting serial output anymore. Are you hooking prints at the Toit level in any way?
i
it is also not the device, it happens on other devices as well
> It would be great to rule out that we're just not getting serial output anymore this is ruled out by the bluetooth device being gone. The container advertises and then stays alive. if the ble device is gone from scanning, the container must be gone
with the watchdog I added a log once it feeds the dog every 30sec
k
I get that, but at the same time, I'd like to know if we just stopped processing some Toit code. If you hook prints at the Toit level, then you need (more) Toit code to run to get anything printed on serial.
i
it also disappears from my router as internet device I think (I was checking it but I can not remember right now anymore)
> If you hook prints at the Toit level, then you need (more) Toit code to run to get anything printed on serial. I don't really get that toit level part
what do you mean by that
k
I think my real question is: Is there anyway the watchdog provider's Toit code continues to run while everything else seems to stall?
f
Your board is an olimex board. Right? Could you maybe send us the code so we can try to reproduce? It's probably enough to get the snapshot/image for the code you don't want to share.
i
no
I can share all of it if you want
but to make it similar you might need an MQTT broker which it can connect to?
thats actually all
f
We can find one.
i
do you know any other ESP32-WROVER board?
f
We have a few more of those as well. Not with Ethernet, though.
i
I somehow have the suspicion it is related to hardware but the code only runs on the WROVER variant due to its size (somehow)
k
@Informatic0re Do you think this can be reproduced without the BLE container?
i
I can not really tell, could be
k
It feels like what you have now is pretty reproducible, which is great. I just wonder if we can make the repro case smaller.
i
the thing is I don't know when it happens. I started the edvice yesterday and within few hours it stopped. then I restarted and it is still running
I can send you the code incl. the BLE part, thats not an issue. But because we don't know what exactly causes it, I would also run it incl. the MQTT part which wants to connect to a broker . We can also not connect it I think but then it might be different already. not sure
or do you want to check if BLE is causing it and not run it?
k
I believe we have a device like yours (I could be wrong) in our test lab that just runs Artemis via ethernet and it syncs with the cloud every 20s for days and days.
i
I also installed the MQTT and BLE container onto another ESP-DevKit using Jaguarg (not artemis) and it was running forever
k
We'll have to double check that it is actually a WROVER.
i
in this case it was a WROOM
I just don't have another WROVER board where I could run the artemis version on just to check if this also happens on other hardware to rule out the olimex board
k
Would it be super annoying to run the Jaguar variant on your WROVER board?
i
no that might work as well
I can install the containers just as they are in artemis then
k
I'm just hoping we can start ruling some things out.
i
okay I will do that then
I will run all containers: watchdog, ethernet, mqtt and ble via jag
k
Thanks! It would also be interesting to try to run Artemis with no containers (except ethernet) and max-offline 0s and see if that also stops sync'ing at some point.
i
tbh I can not recall for sure but it might be related to the IDF changes
k
In the mean time, we'll check our boards and see if we have a WROVER among the ethernet ones.
i
I think you ordered one as well
k
You mean the upgrade to ESP-IDF v5.x?
i
yes
but of course many things have changed til lthen
ok the device is up and running using jag, flashed with esp32-eth-clk-out0-spiram, and all containers are there. watchdog, eth, mqtt and ble
every 30s there is a log in the BLE container which prints
[ble] DEBUG: Feed that dog.. omnomnomnom
k
Great. I've looked around a bit and I started getting worried about internal FreeRTOS stack sizes. https://esp32.com/viewtopic.php?t=30700
I'll dig a bit deeper, but we could try to build a variant with larger stacks. We currently run with 2KB stacks on the ESP32 for the tasks that run Toit code (edit: turns out that is wrong, it really is 8KB).
i
they also say that without WiFi being initialised it does not happen, could we try that as well somehow?
I have another WROVER PoE board where I could run this
then we can see if ti is maybe that, sounds pretty much like the issue
k
Looks like it is still an open issue (the tech lead of the ESP-IDF add a comment to it in September, 2023).
Just found this in our configs:
CONFIG_FREERTOS_ISR_STACKSIZE=2096
. That's a weird number, but it probably not problematic. It looks like someone tried to change it from 4KB to 2KB, but got it wrong 😉
Actually we're using 8KB for the stacks that run Toit code.
i
so as far as I understand it the ESP gets stuck after some panics or overflows in a function of the espressif IDF running
SOC_HAL_STALL_OTHER_CORES()
. They also say, that the RTC is still able to reset the device, not sure if that helps in any way. Could the RTC reset the device in such moments if some watchdog is not triggered in time?
k
Not sure, but maybe.
I think our best bet is trying to figure out if we get panics/overflows and then avoid those.
i
true
that might be the core issue
k
That feels very actionable, but it starts with us being able to understand if that is actually happening.
i
but might be hard to firgure out who is causing them if the device does not tell us
its the right way 😉 finding the root-couse not avoid the symptoms
do they mean that the stack is overflowing? is that caused by allocations or what might cause this?
k
It is caused by the C/C++ code that needs stack space for local variables, etc. So depending on what the code does and when interrupts fire, you'll need more or less space for your stacks.
The stacks are allocated (essentially) at startup and they have a fixed size.
If some code changed and now uses a bit more recursion or more local variables, then we might occasionally need more stack space than we have.
It is a super unsatisfying setup. At the Toit level, we grow stacks on demand. It isn't your Toit code that contributes to the low-level stack space consumption.
i
okay - the jag version is now running on my raspberry pi writing the serial output into a file, lets see if it happens there as well
k
Thanks!
If it is a low-level stack overflow issue of sorts, then we should expect Jaguar/Artemis to behave the same -- modulo the fact that Artemis does a network request every 20s and Jaguar just waits for http clients to connect.
Just found that we're running with lower than default stack size for the lwIP task (2560 vs the default of 3072).
f
The watchdog is hardware based (I think). I don't see how the stack overflow could prevent the device from rebooting.
k
@floitsch Clearly you didn't read the bug report 🙂
Copy code
To answer your question, the WDT does not kick in ever, we have kept the esp32 ON in that state for more than 8 hours.
According to our observation the panic_handler function stop at line

SOC_HAL_STALL_OTHER_CORES();
f
Didn't go deep enough. You are right.
k
@floitsch What would it take for us to produce a variant with a small esp-idf patch that comments out that line? Does the envelopes repository support patching that?
f
I don't think the envelope repo is already prepared for it, but I think it should be feasible.
k
I suggest we drop the non-default stack sizes as a starting point. I'm running tests on that right now.
I'm a little bit concerned that we may need more space for the BLE stack than the default allows for, but that is completely unproven at this point.
SDK v2.0.0-alpha.119 comes with adjusted native stack sizes.
Trying to get a version of Artemis with support for that ready.
i
My device stopped again - this time it was running entirely using jaguar
pretty boring log
k
We have no real indications that this is going to fix it, but Artemis v0.13.2 is out with support SDK v2.0.0-alpha.119 that comes with slightly more stack space for the lwIP task.
@Informatic0re Good to know about Jaguar!
i
I will give it a spin later - does that also apply for jaguar?
k
I can get a Jaguar build out in a few hours.
(maybe a bit before that)
You'll need Jaguar v1.19.0 (unreleased for now).
i
ok
I am fine with artemis as well
ah right the winget situation, I remember 😄
k
You should be able to download Jaguar from here: https://github.com/toitlang/jaguar/releases/tag/v1.19.0 (once the assets are built).
i
just as an idea: if this is related to some memory leakage (if thats even a possibility) an increase in the stack would simply just move the issue to a later point, right?
f
We don't move C stacks, and I'm guessing they are allocated at start.
k
Yes, fixed allocations.
Jaguar v1.19.0 is out.
f
k
Appears complete, so winget should give you Jaguar v1.19.0.
i
it is running now on my pi using this jaguar version
k
@Informatic0re Thanks. I remain a bit sceptical about this being a fix for the issue, but I'm curious to hear what you find.
i
yea me too
@floitsch my device stopped working over night again how is yours doing?
f
Still running strong.
k
@Informatic0re Any updates from running Jaguar v1.19.0 on your device?
i
I had to reboot yesterday evening for a test but sofar it is still running
k
Interesting.
@floitsch pushed an update to the MQTT package with a bug fix, so at some point you may want to upgrade to
mqtt@v2.5.0
.
I guess the frequency of the hang is low enough that we will not be able to conclude anything positive before early next week.
i
usually it happens over night, but lets see. I hope the changes did not just "hide" the problems and make it appear on a later state
k
So if we've solved it, it is most likely due to the extra stack space allocated to the lwIP task. Given the frequency of the issue, it would make sense if the old setting was almost enough (~2.5K), but that the ethernet code would sometimes use a tiny bit too much stack space (compared to wifi, perhaps). The theory is that a detected stack overflow would lead to a panic that would hang both cores at a very low level due to a bug in the esp-idf.
Going to the default lwIP task stack size seems like a good idea (2.5K -> 3K) and the esp-idf probably increased the default size for a good reason. It was increased back in 2018, but we didn't pay enough attention to that: https://github.com/espressif/esp-idf/commit/2ff3f8b0c8b14dc3e9b581d3031689e13c5530a6. The commit message also wasn't super helpful 😋
i
okay.. I mean I am happy if thats the fix! that would be amazing
I will keep it running and see if it stays
f
I'm still running two devices at work with your setup.
i
but with old jag?
f
yes
i
and you haven't seen the hang yeT?
f
I'm working from home today, so will only see if it "worked" tomorrow or (maybe even only) next week
i
ah ok
thats fine
f
but until yesterday evening it didn't reproduce.
k
This means that is hadn't reproduced when you checked yesterday evening, right? Not that it reproduced yesterday evening.
f
Correct. I hadn't reproduced when I checked at that time.
k
@Informatic0re Let us know the status of your tests on alpha.119. Is it possible for you to run the tests over the weekend too so we get more data?
i
the device is still running (second night). I will keep it running over the weekend. if it still runs on monday it might be fixed
k
It is pretty crazy to think about. You reported the hang and we found this old bug report that matched the description pretty well. We checked our stack limits and noticed they were slightly off from the defaults. I assume we believe this is the issue @nlsrchtr reported on October 16?
i
yes exactly - it feels a bit strange and "too easy"
I will swtich to Artemis on monday again and keep it running with that
@floitsch are you running the code with jag or artemis right now?
k
I believe Florian is using Jaguar.
f
I'm using Jaguar.
i
you can nor check if it is still running right? because it s in the office? but on monday the nI guees?
f
I ended up working from home. I will check on Monday.
i
the jaguar device is still running
no hang
I will replace it now with the artemis version
k
You should be able to use Artemis v0.13.3 with SDK v2.0.0-alpha.120.
i
thats what I am running
the device from @nlsrchtr got an update as well on friday with artemis and is also still running today
devices are still running - I think we can almost count this bug as fixed
but @floitsch you have not been able to reproduce it, right?
f
Both boards are still doing fine (as of yesterday evening)
k
@Informatic0re There's a chance that the timings involved in talking to the servers play a role in this.
Florian's setup is different in this respect, so maybe the lwIP stack is exercised in a different way there.
f
My two devices are still running without issues (after a week?). So I agree with Kasper. There is a chance that my setup is just slightly different and we never hit the overflow (assuming that was the reason).
i
my device here also still runs with artemis since monday
f
Mine is still running the old Jaguar.