Getting watchdog up and running Toit #help

Getting watchdog up and running

Informatic0re

11/06/2023, 11:21 AM

Hey @floitsch and @kasperl I try to run the watchdog implementation and get the following issues (file is following) I guess there is some sort of racecondition that the watchdog service is not yet running when my containers try to open it. Should I use

with_timeout

to open the connection or is there a different approach?

Informatic0re

11/06/2023, 11:21 AM

this is the log https://cdn.discordapp.com/attachments/1171046614731862088/1171046675599609906/message.txt?ex=655b4134&is=6548cc34&hm=ccae018bdeec6a1c23e2d1cd1aaefaaf95ca21f5090e49af5d9726ddd30e76df&

Informatic0re

11/06/2023, 11:27 AM

with_timeout

makes it worse. It somehow can not find the service

Copy code

******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.118>
******************************************************************************
EXCEPTION error. 
Cannot find service
  0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49
  1: ServiceClient.open        <sdk>\system\services.toit:176:40
  2: ServiceClient.open        <sdk>\system\services.toit:167:12
  3: main.<block>              C:\Users\Mirko\AppData\Local\Temp\artemis-7ca7d9eb-1271-4aa9-a13a-d976cb1b8302\clone\src\services\mqtt\mqtt.toit:58:20
  4: Task_.with-deadline_.<block> <sdk>\core\task.toit:203:16
  5: Task_.with-deadline_      <sdk>\core\task.toit:197:3
  6: with-timeout              <sdk>\core\utils.toit:181:24
  7: with-timeout              <sdk>\core\utils.toit:173:10
  8: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-7ca7d9eb-1271-4aa9-a13a-d976cb1b8302\clone\src\services\mqtt\mqtt.toit:56:3
******************************************************************************

Informatic0re

11/06/2023, 11:28 AM

I install the provider in my ethernet container (its the simplest one) like:

Copy code

main:
  (provider.WatchdogServiceProvider).install
  logger.debug "Watchdog provider installed"

  watchdogclient := WatchdogServiceClient
  watchdogclient.open
  dog := watchdogclient.create "mqtt-dog"
  dog.start --s=60
  logger.debug "Watchdog started"
  dog.feed
  dog.stop
  dog.close

not sure if I need to call it once but 'I wanted to give it a try

floitsch

11/06/2023, 12:40 PM

Hmm. This looks like the watchdog provider failed to feed the system watchdog.

floitsch

11/06/2023, 12:40 PM

It should have a second to do that.

Informatic0re

11/06/2023, 12:50 PM

do I need to update anything else?

floitsch

11/06/2023, 12:50 PM

I'm testing your second program now.

Informatic0re

11/06/2023, 12:51 PM

Copy code

with-timeout --ms=2000:
    watchdogclient := WatchdogServiceClient
    watchdogclient.open
    dog = watchdogclient.create "mqtt-dog"
    dog.start --s=60

I am calling it like so

Informatic0re

11/06/2023, 12:51 PM

and in my ble container and my mqtt container I do have a ehile true loop which simply sleeps 5 or 10s and there I am calling dog.feed (but I guess we never reach that part so far

floitsch

11/06/2023, 12:52 PM

I did just find a bug.

floitsch

11/06/2023, 12:53 PM

Don't yet see how it could lead to the seen issue, though.

kasperl

11/06/2023, 12:54 PM

The default timeout for client.open is 100ms. You can specify a higher one like this

watchdogclient.open --timeout=(Duration --s=1)

. Not sure if that is necessary.

floitsch

11/06/2023, 1:00 PM

Please update the package. From your descriptions it doesn't fix every issue, but there is at least one less bug in it now.

Informatic0re

11/06/2023, 1:04 PM

jaguar is lying to me

Informatic0re

11/06/2023, 1:04 PM

Copy code

$ jag pkg install watchdog
Info: Package 'github.com/toitware/toit-watchdog@1.1.0' installed with name 'watchdog'

Informatic0re

11/06/2023, 1:04 PM

in the package yaml and in the folder I still only see 1.0.1

kasperl

11/06/2023, 1:05 PM

jag pkg update

kasperl

11/06/2023, 1:06 PM

(or uninstall first)

Informatic0re

11/06/2023, 1:06 PM

yea nevermind I was running it a completely wrong folder 🙄

kasperl

11/06/2023, 1:06 PM

Even better 🙂

Informatic0re

11/06/2023, 1:10 PM

https://cdn.discordapp.com/attachments/1171046614731862088/1171074043496104069/message.txt?ex=655b5ab1&is=6548e5b1&hm=a6fc836aa0874c50c68e85656f92a60ff405da498392ab8583ed61656e93acdf&

Informatic0re

11/06/2023, 1:10 PM

I removed the with_timeout part but this seems to be the same issue

Informatic0re

11/06/2023, 1:10 PM

I did not changed the timeout

Informatic0re

11/06/2023, 1:12 PM

Copy code

[eth] DEBUG: Watchdog provider installed

******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.118>
******************************************************************************
EXCEPTION error. 
Cannot find service
  0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49
  1: ServiceClient.open        <sdk>\system\services.toit:176:40
  2: ServiceClient.open        <sdk>\system\services.toit:167:12
  3: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-cdea2632-796b-42c6-8473-9813f0c442d2\clone\src\ble.toit:35:18
******************************************************************************

[eth] DEBUG: Watchdog started

not sure at which part exactly it is crashing but it might be the provider install

floitsch

11/06/2023, 1:14 PM

Does it work to run the provider in the same container?

floitsch

11/06/2023, 1:14 PM

This code works for me:

Copy code

import watchdog
import watchdog.provider

main:
  provider.main
  print "installed"

  client := watchdog.WatchdogServiceClient
  client.open
  dog := client.create "foo"
  dog.start --s=60
  print "started"
  dog.feed
  dog.stop
  print "stopped"
  dog.close

Informatic0re

11/06/2023, 1:15 PM

this is the client part - I think the provider install does not work already

floitsch

11/06/2023, 1:15 PM

This has the provider install in it.

floitsch

11/06/2023, 1:15 PM

(

provider.main

does an

install

)

Informatic0re

11/06/2023, 1:16 PM

ah ok

Informatic0re

11/06/2023, 1:25 PM

Copy code

dogprovider.main
  logger.debug "Watchdog provider installed"

  watchdogclient := WatchdogServiceClient
  watchdogclient.open
  dog := watchdogclient.create "mqtt-dog"
  dog.start --s=60
  logger.debug "Watchdog started"
  dog.feed
  dog.stop
  dog.close

Informatic0re

11/06/2023, 1:25 PM

this is working for me now - but I had to disable the clients in the other containers

Informatic0re

11/06/2023, 1:26 PM

if I try to open the watchdog service connection with the client in one of the other containers it crashes

floitsch

11/06/2023, 1:30 PM

interesting. Let me try that. Clearly I didn't test it enough... 😦

floitsch

11/06/2023, 1:45 PM

For me things are working now. I'm installing the provider in one container. Then use the clients in other containers to get their dogs.

floitsch

11/06/2023, 1:45 PM

What kind of crashes do you get?

kasperl

11/06/2023, 1:49 PM

(just a thought: is there a risk that the ethernet container is doing other work for more than 1s thus starving the watchdog provider?)

floitsch

11/06/2023, 1:50 PM

I would hope that it yields at least once every two seconds.

floitsch

11/06/2023, 1:52 PM

Let me create a new version with more logger entries. That should help.

Informatic0re

11/06/2023, 2:05 PM

Copy code

******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.118>
******************************************************************************
EXCEPTION error. 
Cannot find service
  0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49
  1: ServiceClient.open        <sdk>\system\services.toit:176:40
  2: ServiceClient.open        <sdk>\system\services.toit:167:12
  3: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-616970cc-d0f6-46c7-88c7-f1db1182f3eb\clone\src\services\mqtt\mqtt.toit:57:18
******************************************************************************

Informatic0re

11/06/2023, 2:06 PM

I receive this still - but I think this is in the mqtt container

Informatic0re

11/06/2023, 2:11 PM

after the first part working I added the watchdog open to the ble container an got this issue https://cdn.discordapp.com/attachments/1171046614731862088/1171089563473678426/message.txt?ex=655b6925&is=6548f425&hm=7c6bc25f57f3a250f5ff704c481361b750abfe42d2ffcd404edadb2b0e7f264d&

Informatic0re

11/06/2023, 2:12 PM

I increased the timrout to 2s

Informatic0re

11/06/2023, 2:12 PM

with that it was not crashing - or at least it crashed now later with the watchdog triggering.. but I am not sure why tho

Informatic0re

11/06/2023, 2:16 PM

I am running this at the beginning of the main

Copy code

watchdogclient := WatchdogServiceClient
  watchdogclient.open --timeout=(Duration --s=2)
  dog = watchdogclient.create "ble-dog"
  dog.start --s=60

and then a bit later in my while loop:

Copy code

while true:
    dog.feed
    sleep --ms=5000

that should actually not trigger the watchdog

floitsch

11/06/2023, 2:18 PM

Not sure it's relevant, but there is no support for uninstalling the watchdog provider. If you kill that container (or install another one on top), then you will likely get a watchdog-trigger, since the original system-watchdog-loop isn't running anymore.

floitsch

11/06/2023, 2:19 PM

However that should only be an issue for one watchdog-reset. After that only the provider you installed should run.

floitsch

11/06/2023, 2:20 PM

This looks as if the watchdog timer was started, but then failed to run.

floitsch

11/06/2023, 2:20 PM

It has 2 seconds to do so. I don't see how a process would be delayed by that much.

Informatic0re

11/06/2023, 2:23 PM

it feeds the dog right after the debug line:

[ble] DEBUG: Advertising: 2d66 with name sbceade1c9c

Informatic0re

11/06/2023, 2:23 PM

no delay - after that I see a more or less 2s of nothing an then it crashes

Informatic0re

11/06/2023, 2:25 PM

https://cdn.discordapp.com/attachments/1171046614731862088/1171092871189901362/message.txt?ex=655b6c3a&is=6548f73a&hm=681fc4d53db8eab04abba1e4f594bbc4d9db57089b4ad74975cccaef20e8449d&

floitsch

11/06/2023, 2:25 PM

Try to upgrade to v1.2.0, and then use the following container to install the provider:

Copy code

import log
import watchdog
import watchdog.provider

main:
  provider := provider.WatchdogServiceProvider
      --logger=((log.default.with-name "watchdog").with-level log.DEBUG-LEVEL)
  provider.install
  print "installed"

Informatic0re

11/06/2023, 2:25 PM

I marked it - it does say no connection available tho

Informatic0re

11/06/2023, 2:50 PM

that seem to work now

Informatic0re

11/06/2023, 2:51 PM

https://cdn.discordapp.com/attachments/1171046614731862088/1171099490132955236/message.txt?ex=655b7264&is=6548fd64&hm=8d35b96aff31cceeea4a37068dd2525dffa5bc1c65620c6866bcf70a8b6e2cd7&

Informatic0re

11/06/2023, 2:56 PM

should the ethernet container also get the watchdog? it basically only installs the provider so I think it doesn't really need it, right

floitsch

11/06/2023, 2:58 PM

I don't think so.

floitsch

11/06/2023, 2:59 PM

If you bump the log-level (to

INFO

) you would drop the

feeding system watchdog

messages.

floitsch

11/06/2023, 2:59 PM

If you want to keep some watchdog logging.

Informatic0re

11/06/2023, 3:04 PM

have you tried it with an firmware update actually?

Informatic0re

11/06/2023, 3:05 PM

🥲 the watchdog triggers of course if I try to update the device

floitsch

11/06/2023, 3:09 PM

You would need to stop all the timers.

floitsch

11/06/2023, 3:11 PM

We will integrate the watchdog timers with Artemis a bit more. 1. if a container doesn't work, we don't need to reset immediately but can try to just restart the container. 2. depending on the settings, a container might allow to disable the watchdog automatically on fw updates. Critical containers should still have their watchdogs run. 3. Artemis will probably have the ability to disable watchdogs so it can make progress (in case one just keeps getting in the way).

floitsch

11/06/2023, 3:12 PM

We will have to think a bit more about how this should work best.

Informatic0re

11/06/2023, 3:12 PM

for the sake of testing right now it doesn't matter, I mostly flash anyway

floitsch

11/06/2023, 3:12 PM

Oh. And the watchdog-provider container should probably be marked as critical.

Informatic0re

11/06/2023, 3:12 PM

I would not know when to close the watchdog for a firmware update tbh

floitsch

11/06/2023, 3:13 PM

For now I would probably mark the watchdog-provider container as critical, and use very big timeouts for the watchdogs.

floitsch

11/06/2023, 3:13 PM

Just to make sure something is happening within 10 minutes or so.

floitsch

11/06/2023, 3:13 PM

At the very least the devices should never get fully stuck this way.

Informatic0re

11/06/2023, 3:14 PM

you mean the feeding should happen only once every few minutes?

floitsch

11/06/2023, 3:14 PM

The timeout should be set to 10 minutes. The feeding could still be more frequent.

floitsch

11/06/2023, 3:14 PM

In the current setup.

Informatic0re

11/06/2023, 3:14 PM

ah so the timeout is the time I set when creating the watchdog

floitsch

11/06/2023, 3:14 PM

This would give Artemis a window of 10 minutes to do its thing.

floitsch

11/06/2023, 3:15 PM

Correct. At

start

you tell the watchdog the max interval between feedings.

Informatic0re

11/06/2023, 3:15 PM

dog.start --s=60 <---- this

Informatic0re

11/06/2023, 3:15 PM

okay

floitsch

11/06/2023, 3:15 PM

You have 1 minute to feed it. Yes.

floitsch

11/06/2023, 3:16 PM

If you make this 10 minutes, and feed every 30 seconds, then Artemis has 9:30 to do its thing.

floitsch

11/06/2023, 3:16 PM

(when it shuts down your container for a firmware update).

floitsch

11/06/2023, 3:16 PM

Again: in the future we should improve this.

Informatic0re

11/06/2023, 3:16 PM

yes yes

floitsch

11/06/2023, 3:16 PM

That said: I'm actually not 100% sure this will work.

floitsch

11/06/2023, 3:17 PM

I think a

deep-sleep 0

with a watchdog running triggers a watchdog error.

floitsch

11/06/2023, 3:17 PM

Not sure if we then try to update.

floitsch

11/06/2023, 3:17 PM

I will test that.

Informatic0re

11/06/2023, 3:18 PM

I am almost pretty sure, that this will not catch my issue with the device disappearing but I really hope so

Informatic0re

11/06/2023, 3:26 PM

okay the watchdog is running on the device and writing the log into a file

Informatic0re

11/06/2023, 3:26 PM

I only watch ble and mqtt I guess the others do not really need a watch

floitsch

11/06/2023, 3:33 PM

Looks good. I was able to do a fw update with the watchdogs active.

Informatic0re

11/06/2023, 7:25 PM

pretty long log but at some point there was a 502 and then the synchronise job stopped printing its debug synchronise log! https://cdn.discordapp.com/attachments/1171046614731862088/1171168381639065640/message.txt?ex=655bb28d&is=65493d8d&hm=be7ea56f9403db3d90c329dd3c9699606bae9f923ec75b60e3db0364d77eaf52&

Informatic0re

11/06/2023, 7:25 PM

and few minutes later the whole device stopped working

kasperl

11/06/2023, 7:27 PM

But the watchdog still didn't force a reboot?

Informatic0re

11/06/2023, 7:59 PM

nope the log stopped completely as well

Informatic0re

11/06/2023, 7:59 PM

I really wonder what that could be

floitsch

11/06/2023, 8:37 PM

So the synchronize stopped but the watchdogs were still fed?

floitsch

11/06/2023, 8:38 PM

It looks like we need to add a watchdog for our own Artemis code to avoid this.

kasperl

11/07/2023, 4:51 AM

As I understand it, the watchdog feeding code - BLE + MQTT containers - also stopped, but nothing rebooted.

Informatic0re

11/07/2023, 7:56 AM

exactly

kasperl

11/07/2023, 7:57 AM

Does the device have any LEDs?

Informatic0re

11/07/2023, 7:57 AM

yes

kasperl

11/07/2023, 7:57 AM

Anything we can blink on regular intervals?

Informatic0re

11/07/2023, 7:57 AM

let me check

Informatic0re

11/07/2023, 7:58 AM

it has a bunch of leds to show Rx and Tx of the ethernet

Informatic0re

11/07/2023, 7:58 AM

but I will chekc if it has some debug led on it

Informatic0re

11/07/2023, 8:01 AM

no debug led but I could maybe add one

Informatic0re

11/07/2023, 8:02 AM

the way I check if the device is still alive is, I check for the BLE device on my phone. if it is gone, the container is gone

kasperl

11/07/2023, 8:02 AM

You also get no serial output, right?

Informatic0re

11/07/2023, 8:03 AM

exactly, nothing is printed there

Informatic0re

11/07/2023, 8:03 AM

only if I hard reset the device using the reset button

kasperl

11/07/2023, 8:04 AM

So we don't see any prints from the watchdog provider, so we assume that it actually isn't resetting the low-level watchdog, which should cause that to reboot the system. It's pretty weird.

kasperl

11/07/2023, 8:04 AM

It would be great to rule out that we're just not getting serial output anymore. Are you hooking prints at the Toit level in any way?

Informatic0re

11/07/2023, 8:05 AM

it is also not the device, it happens on other devices as well

Informatic0re

11/07/2023, 8:06 AM

> It would be great to rule out that we're just not getting serial output anymore this is ruled out by the bluetooth device being gone. The container advertises and then stays alive. if the ble device is gone from scanning, the container must be gone

Informatic0re

11/07/2023, 8:07 AM

with the watchdog I added a log once it feeds the dog every 30sec

kasperl

11/07/2023, 8:07 AM

I get that, but at the same time, I'd like to know if we just stopped processing some Toit code. If you hook prints at the Toit level, then you need (more) Toit code to run to get anything printed on serial.

Informatic0re

11/07/2023, 8:07 AM

it also disappears from my router as internet device I think (I was checking it but I can not remember right now anymore)

Informatic0re

11/07/2023, 8:08 AM

> If you hook prints at the Toit level, then you need (more) Toit code to run to get anything printed on serial. I don't really get that toit level part

Informatic0re

11/07/2023, 8:08 AM

what do you mean by that

kasperl

11/07/2023, 8:08 AM

Are you implementing the PrintService? https://github.com/toitlang/toit/blob/master/lib/system/api/print.toit

kasperl

11/07/2023, 8:09 AM

I think my real question is: Is there anyway the watchdog provider's Toit code continues to run while everything else seems to stall?

floitsch

11/07/2023, 8:10 AM

Your board is an olimex board. Right? Could you maybe send us the code so we can try to reproduce? It's probably enough to get the snapshot/image for the code you don't want to share.

Informatic0re

11/07/2023, 8:11 AM

Informatic0re

11/07/2023, 8:11 AM

I can share all of it if you want

Informatic0re

11/07/2023, 8:12 AM

but to make it similar you might need an MQTT broker which it can connect to?

Informatic0re

11/07/2023, 8:12 AM

thats actually all

floitsch

11/07/2023, 8:12 AM

We can find one.

Informatic0re

11/07/2023, 8:14 AM

do you know any other ESP32-WROVER board?

floitsch

11/07/2023, 8:14 AM

We have a few more of those as well. Not with Ethernet, though.

Informatic0re

11/07/2023, 8:15 AM

I somehow have the suspicion it is related to hardware but the code only runs on the WROVER variant due to its size (somehow)

kasperl

11/07/2023, 8:15 AM

@Informatic0re Do you think this can be reproduced without the BLE container?

Informatic0re

11/07/2023, 8:15 AM

I can not really tell, could be

kasperl

11/07/2023, 8:16 AM

It feels like what you have now is pretty reproducible, which is great. I just wonder if we can make the repro case smaller.

Informatic0re

11/07/2023, 8:17 AM

the thing is I don't know when it happens. I started the edvice yesterday and within few hours it stopped. then I restarted and it is still running

Informatic0re

11/07/2023, 8:19 AM

I can send you the code incl. the BLE part, thats not an issue. But because we don't know what exactly causes it, I would also run it incl. the MQTT part which wants to connect to a broker . We can also not connect it I think but then it might be different already. not sure

Informatic0re

11/07/2023, 8:20 AM

or do you want to check if BLE is causing it and not run it?

kasperl

11/07/2023, 8:22 AM

I believe we have a device like yours (I could be wrong) in our test lab that just runs Artemis via ethernet and it syncs with the cloud every 20s for days and days.

Informatic0re

11/07/2023, 8:23 AM

I also installed the MQTT and BLE container onto another ESP-DevKit using Jaguarg (not artemis) and it was running forever

kasperl

11/07/2023, 8:23 AM

We'll have to double check that it is actually a WROVER.

Informatic0re

11/07/2023, 8:23 AM

in this case it was a WROOM

Informatic0re

11/07/2023, 8:24 AM

I just don't have another WROVER board where I could run the artemis version on just to check if this also happens on other hardware to rule out the olimex board

kasperl

11/07/2023, 8:24 AM

Would it be super annoying to run the Jaguar variant on your WROVER board?

Informatic0re

11/07/2023, 8:25 AM

no that might work as well

Informatic0re

11/07/2023, 8:25 AM

I can install the containers just as they are in artemis then

kasperl

11/07/2023, 8:25 AM

I'm just hoping we can start ruling some things out.

Informatic0re

11/07/2023, 8:25 AM

okay I will do that then

Informatic0re

11/07/2023, 8:26 AM

I will run all containers: watchdog, ethernet, mqtt and ble via jag

kasperl

11/07/2023, 8:26 AM

Thanks! It would also be interesting to try to run Artemis with no containers (except ethernet) and max-offline 0s and see if that also stops sync'ing at some point.

Informatic0re

11/07/2023, 8:27 AM

tbh I can not recall for sure but it might be related to the IDF changes

kasperl

11/07/2023, 8:27 AM

In the mean time, we'll check our boards and see if we have a WROVER among the ethernet ones.

Informatic0re

11/07/2023, 8:27 AM

I think you ordered one as well

kasperl

11/07/2023, 8:27 AM

You mean the upgrade to ESP-IDF v5.x?

Informatic0re

11/07/2023, 8:27 AM

yes

Informatic0re

11/07/2023, 8:28 AM

but of course many things have changed til lthen

Informatic0re

11/07/2023, 8:37 AM

ok the device is up and running using jag, flashed with esp32-eth-clk-out0-spiram, and all containers are there. watchdog, eth, mqtt and ble

Informatic0re

11/07/2023, 8:38 AM

every 30s there is a log in the BLE container which prints

[ble] DEBUG: Feed that dog.. omnomnomnom

kasperl

11/07/2023, 8:38 AM

Great. I've looked around a bit and I started getting worried about internal FreeRTOS stack sizes. https://esp32.com/viewtopic.php?t=30700

kasperl

11/07/2023, 8:40 AM

I'll dig a bit deeper, but we could try to build a variant with larger stacks. We currently run with 2KB stacks on the ESP32 for the tasks that run Toit code (edit: turns out that is wrong, it really is 8KB).

Informatic0re

11/07/2023, 8:40 AM

they also say that without WiFi being initialised it does not happen, could we try that as well somehow?

Informatic0re

11/07/2023, 8:41 AM

I have another WROVER PoE board where I could run this

Informatic0re

11/07/2023, 8:41 AM

then we can see if ti is maybe that, sounds pretty much like the issue

kasperl

11/07/2023, 8:43 AM

Lots of deeper digging here: https://github.com/espressif/esp-idf/issues/10110.

kasperl

11/07/2023, 8:45 AM

Looks like it is still an open issue (the tech lead of the ESP-IDF add a comment to it in September, 2023).

kasperl

11/07/2023, 8:49 AM

Just found this in our configs:

CONFIG_FREERTOS_ISR_STACKSIZE=2096

. That's a weird number, but it probably not problematic. It looks like someone tried to change it from 4KB to 2KB, but got it wrong 😉

kasperl

11/07/2023, 8:53 AM

Actually we're using 8KB for the stacks that run Toit code.

Informatic0re

11/07/2023, 8:54 AM

so as far as I understand it the ESP gets stuck after some panics or overflows in a function of the espressif IDF running

SOC_HAL_STALL_OTHER_CORES()

. They also say, that the RTC is still able to reset the device, not sure if that helps in any way. Could the RTC reset the device in such moments if some watchdog is not triggered in time?

kasperl

11/07/2023, 8:54 AM

Not sure, but maybe.

kasperl

11/07/2023, 8:55 AM

I think our best bet is trying to figure out if we get panics/overflows and then avoid those.

Informatic0re

11/07/2023, 8:55 AM

true

Informatic0re

11/07/2023, 8:55 AM

that might be the core issue

kasperl

11/07/2023, 8:55 AM

That feels very actionable, but it starts with us being able to understand if that is actually happening.

Informatic0re

11/07/2023, 8:55 AM

but might be hard to firgure out who is causing them if the device does not tell us

Informatic0re

11/07/2023, 8:56 AM

its the right way 😉 finding the root-couse not avoid the symptoms

Informatic0re

11/07/2023, 8:57 AM

do they mean that the stack is overflowing? is that caused by allocations or what might cause this?

kasperl

11/07/2023, 8:59 AM

It is caused by the C/C++ code that needs stack space for local variables, etc. So depending on what the code does and when interrupts fire, you'll need more or less space for your stacks.

kasperl

11/07/2023, 9:00 AM

The stacks are allocated (essentially) at startup and they have a fixed size.

kasperl

11/07/2023, 9:00 AM

If some code changed and now uses a bit more recursion or more local variables, then we might occasionally need more stack space than we have.

kasperl

11/07/2023, 9:02 AM

It is a super unsatisfying setup. At the Toit level, we grow stacks on demand. It isn't your Toit code that contributes to the low-level stack space consumption.

Informatic0re

11/07/2023, 9:07 AM

okay - the jag version is now running on my raspberry pi writing the serial output into a file, lets see if it happens there as well

kasperl

11/07/2023, 9:07 AM

Thanks!

kasperl

11/07/2023, 9:08 AM

If it is a low-level stack overflow issue of sorts, then we should expect Jaguar/Artemis to behave the same -- modulo the fact that Artemis does a network request every 20s and Jaguar just waits for http clients to connect.

kasperl

11/07/2023, 9:32 AM

Just found that we're running with lower than default stack size for the lwIP task (2560 vs the default of 3072).

floitsch

11/07/2023, 9:57 AM

The watchdog is hardware based (I think). I don't see how the stack overflow could prevent the device from rebooting.

kasperl

11/07/2023, 9:58 AM

@floitsch Clearly you didn't read the bug report 🙂

kasperl

11/07/2023, 9:59 AM

Copy code

To answer your question, the WDT does not kick in ever, we have kept the esp32 ON in that state for more than 8 hours.
According to our observation the panic_handler function stop at line

SOC_HAL_STALL_OTHER_CORES();

floitsch

11/07/2023, 10:00 AM

Didn't go deep enough. You are right.

kasperl

11/07/2023, 10:01 AM

@floitsch What would it take for us to produce a variant with a small esp-idf patch that comments out that line? Does the envelopes repository support patching that?

floitsch

11/07/2023, 10:19 AM

I don't think the envelope repo is already prepared for it, but I think it should be feasible.

kasperl

11/07/2023, 10:43 AM

I suggest we drop the non-default stack sizes as a starting point. I'm running tests on that right now.

kasperl

11/07/2023, 10:44 AM

I'm a little bit concerned that we may need more space for the BLE stack than the default allows for, but that is completely unproven at this point.

kasperl

11/07/2023, 2:08 PM

SDK v2.0.0-alpha.119 comes with adjusted native stack sizes.

kasperl

11/07/2023, 2:09 PM

Trying to get a version of Artemis with support for that ready.

Informatic0re

11/07/2023, 3:09 PM

My device stopped again - this time it was running entirely using jaguar

Informatic0re

11/07/2023, 3:13 PM

https://cdn.discordapp.com/attachments/1171046614731862088/1171467374344544286/screenlog.0?ex=655cc902&is=654a5402&hm=44d2c5372521e049c57d20507cf195f6160656cde9e9f606981d6463c854a320&

Informatic0re

11/07/2023, 3:13 PM

pretty boring log

kasperl

11/07/2023, 3:15 PM

We have no real indications that this is going to fix it, but Artemis v0.13.2 is out with support SDK v2.0.0-alpha.119 that comes with slightly more stack space for the lwIP task.

kasperl

11/07/2023, 3:15 PM

@Informatic0re Good to know about Jaguar!

Informatic0re

11/07/2023, 3:17 PM

I will give it a spin later - does that also apply for jaguar?

kasperl

11/07/2023, 3:20 PM

I can get a Jaguar build out in a few hours.

kasperl

11/07/2023, 3:21 PM

(maybe a bit before that)

kasperl

11/07/2023, 3:22 PM

You'll need Jaguar v1.19.0 (unreleased for now).

Informatic0re

11/07/2023, 3:23 PM

Informatic0re

11/07/2023, 3:23 PM

I am fine with artemis as well

Informatic0re

11/07/2023, 3:23 PM

ah right the winget situation, I remember 😄

kasperl

11/07/2023, 3:25 PM

You should be able to download Jaguar from here: https://github.com/toitlang/jaguar/releases/tag/v1.19.0 (once the assets are built).

Informatic0re

11/07/2023, 4:36 PM

just as an idea: if this is related to some memory leakage (if thats even a possibility) an increase in the stack would simply just move the issue to a later point, right?

floitsch

11/07/2023, 4:41 PM

We don't move C stacks, and I'm guessing they are allocated at start.

kasperl

11/07/2023, 5:31 PM

Yes, fixed allocations.

kasperl

11/07/2023, 5:31 PM

Jaguar v1.19.0 is out.

floitsch

11/07/2023, 5:32 PM

Winget is pending: https://github.com/microsoft/winget-pkgs/pull/125042

kasperl

11/08/2023, 5:18 AM

Appears complete, so winget should give you Jaguar v1.19.0.

Informatic0re

11/08/2023, 7:33 AM

it is running now on my pi using this jaguar version

kasperl

11/08/2023, 8:01 AM

@Informatic0re Thanks. I remain a bit sceptical about this being a fix for the issue, but I'm curious to hear what you find.

Informatic0re

11/08/2023, 8:06 AM

yea me too

Informatic0re

11/08/2023, 8:07 AM

@floitsch my device stopped working over night again how is yours doing?

floitsch

11/08/2023, 9:42 AM

Still running strong.

kasperl

11/09/2023, 6:03 AM

@Informatic0re Any updates from running Jaguar v1.19.0 on your device?

Informatic0re

11/09/2023, 7:21 AM

I had to reboot yesterday evening for a test but sofar it is still running

kasperl

11/09/2023, 7:21 AM

Interesting.

kasperl

11/09/2023, 7:22 AM

@floitsch pushed an update to the MQTT package with a bug fix, so at some point you may want to upgrade to

mqtt@v2.5.0

kasperl

11/09/2023, 7:24 AM

I guess the frequency of the hang is low enough that we will not be able to conclude anything positive before early next week.

Informatic0re

11/09/2023, 7:46 AM

usually it happens over night, but lets see. I hope the changes did not just "hide" the problems and make it appear on a later state

kasperl

11/09/2023, 7:49 AM

So if we've solved it, it is most likely due to the extra stack space allocated to the lwIP task. Given the frequency of the issue, it would make sense if the old setting was almost enough (~2.5K), but that the ethernet code would sometimes use a tiny bit too much stack space (compared to wifi, perhaps). The theory is that a detected stack overflow would lead to a panic that would hang both cores at a very low level due to a bug in the esp-idf.

kasperl

11/09/2023, 7:54 AM

Going to the default lwIP task stack size seems like a good idea (2.5K -> 3K) and the esp-idf probably increased the default size for a good reason. It was increased back in 2018, but we didn't pay enough attention to that: https://github.com/espressif/esp-idf/commit/2ff3f8b0c8b14dc3e9b581d3031689e13c5530a6. The commit message also wasn't super helpful 😋

Informatic0re

11/09/2023, 9:49 AM

okay.. I mean I am happy if thats the fix! that would be amazing

Informatic0re

11/09/2023, 9:49 AM

I will keep it running and see if it stays

floitsch

11/09/2023, 9:49 AM

I'm still running two devices at work with your setup.

Informatic0re

11/09/2023, 9:50 AM

but with old jag?

floitsch

11/09/2023, 9:50 AM

yes

Informatic0re

11/09/2023, 9:50 AM

and you haven't seen the hang yeT?

floitsch

11/09/2023, 9:50 AM

I'm working from home today, so will only see if it "worked" tomorrow or (maybe even only) next week

Informatic0re

11/09/2023, 9:50 AM

ah ok

Informatic0re

11/09/2023, 9:50 AM

thats fine

floitsch

11/09/2023, 9:50 AM

but until yesterday evening it didn't reproduce.

kasperl

11/09/2023, 11:58 AM

This means that is hadn't reproduced when you checked yesterday evening, right? Not that it reproduced yesterday evening.

floitsch

11/09/2023, 12:31 PM

Correct. I hadn't reproduced when I checked at that time.

kasperl

11/10/2023, 6:04 AM

@Informatic0re Let us know the status of your tests on alpha.119. Is it possible for you to run the tests over the weekend too so we get more data?

Informatic0re

11/10/2023, 6:39 AM

the device is still running (second night). I will keep it running over the weekend. if it still runs on monday it might be fixed

kasperl

11/10/2023, 6:45 AM

It is pretty crazy to think about. You reported the hang and we found this old bug report that matched the description pretty well. We checked our stack limits and noticed they were slightly off from the defaults. I assume we believe this is the issue @nlsrchtr reported on October 16?

Informatic0re

11/10/2023, 7:17 AM

yes exactly - it feels a bit strange and "too easy"

Informatic0re

11/10/2023, 7:18 AM

I will swtich to Artemis on monday again and keep it running with that

Informatic0re

11/10/2023, 7:19 AM

@floitsch are you running the code with jag or artemis right now?

kasperl

11/10/2023, 7:20 AM

I believe Florian is using Jaguar.

floitsch

11/10/2023, 7:40 AM

I'm using Jaguar.

Informatic0re

11/10/2023, 10:14 AM

you can nor check if it is still running right? because it s in the office? but on monday the nI guees?

floitsch

11/10/2023, 10:23 AM

I ended up working from home. I will check on Monday.

Informatic0re

11/13/2023, 9:12 AM

the jaguar device is still running

Informatic0re

11/13/2023, 9:12 AM

no hang

Informatic0re

11/13/2023, 9:13 AM

I will replace it now with the artemis version

kasperl

11/13/2023, 9:28 AM

You should be able to use Artemis v0.13.3 with SDK v2.0.0-alpha.120.

Informatic0re

11/13/2023, 9:49 AM

thats what I am running

Informatic0re

11/13/2023, 4:01 PM

the device from @nlsrchtr got an update as well on friday with artemis and is also still running today

Informatic0re

11/15/2023, 7:49 AM

devices are still running - I think we can almost count this bug as fixed

Informatic0re

11/15/2023, 7:49 AM

but @floitsch you have not been able to reproduce it, right?

floitsch

11/15/2023, 7:59 AM

Both boards are still doing fine (as of yesterday evening)

kasperl

11/15/2023, 7:59 AM

@Informatic0re There's a chance that the timings involved in talking to the servers play a role in this.

kasperl

11/15/2023, 8:00 AM

Florian's setup is different in this respect, so maybe the lwIP stack is exercised in a different way there.

floitsch

11/16/2023, 6:19 PM

My two devices are still running without issues (after a week?). So I agree with Kasper. There is a chance that my setup is just slightly different and we never hit the overflow (assuming that was the reason).

Informatic0re

11/16/2023, 6:27 PM

my device here also still runs with artemis since monday

floitsch

11/16/2023, 6:58 PM

Mine is still running the old Jaguar.

Previous Next