jag run crashes toit vm on ESP32 and ESP32S3 Toit #help

jag run crashes toit vm on ESP32 and ESP32S3

mikkel.damsgaard

11/24/2022, 1:43 PM

When I try to run the attached program with jag run, the vm sometimes crashes when I run jag run again and again. It is not consistent. It happens less often on the esp32 than on thes esp32s3. This is on latests master branch of toit and jaguar.

mikkel.damsgaard

11/24/2022, 1:43 PM

service-perf-test

mikkel.damsgaard

11/24/2022, 1:44 PM

crashes

kasperl

11/24/2022, 1:52 PM

I wonder if this is reproducible with

1fc17422b7448fdf0f605648d6ae5552e58ae71a

mikkel.damsgaard

11/24/2022, 2:10 PM

So far, I havent been able to reproduce on that commit.

mikkel.damsgaard

11/24/2022, 2:18 PM

And now it happened on that commit.

mikkel.damsgaard

11/24/2022, 2:19 PM

Copy code

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x403754ea  PS      : 0x00060030  A0      : 0x82024bda  A1      : 0x3fce5950  
0x403754ea: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262

A2      : 0x00000000  A3      : 0x3fce1018  A4      : 0x3c0e1ec8  A5      : 0x00000002  
A6      : 0x00000000  A7      : 0x3fcee958  A8      : 0x803754e5  A9      : 0x3fce5930  
A10     : 0x3009ff14  A11     : 0x3fca0474  A12     : 0x3fca0478  A13     : 0x0000abab  
A14     : 0x00060423  A15     : 0x00060420  SAR     : 0x00000003  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x420ce0ec  LEND    : 0x420ce0f2  LCOUNT  : 0x00000002  
0x420ce0ec: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:938

0x420ce0f2: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:941



Backtrace:0x403754e7:0x3fce59500x42024bd7:0x3fce59d0 0x42024e66:0x3fce5a20 0x42024e91:0x3fce5a50 0x4201760e:0x3fce5a70 0x42017635:0x3fce5a90 
0x403754e7: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262

0x42024bd7: toit::Scheduler::run_process(toit::Locker&, toit::Process*, toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:627

0x42024e66: toit::Scheduler::run(toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:404

0x42024e91: toit::SchedulerThread::entry() at ./toolchains/esp32s3/build/./src/scheduler.cc:38

0x4201760e: toit::Thread::_boot() at ./toolchains/esp32s3/build/./src/os_esp32.cc:302

0x42017635: toit::thread_start(void*) at ./toolchains/esp32s3/build/./src/os_esp32.cc:289
 (inlined by) esp_thread_start at ./toolchains/esp32s3/build/./src/os_esp32.cc:294

mikkel.damsgaard

11/24/2022, 2:23 PM

FYI: I am using this script to run an extended test.

Copy code

#!/bin/bash
while :
do
  /Users/mikkel/proj/jaguar/build/jag run service_perf_test.toit -d 192.168.1.129
  sleep 2
done

mikkel.damsgaard

11/24/2022, 2:26 PM

Also reproduced in 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9 After quite a while

mikkel.damsgaard

11/24/2022, 2:39 PM

The log from a 10 minute or so run:

crash

mikkel.damsgaard

11/24/2022, 2:39 PM

On 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9

mikkel.damsgaard

11/24/2022, 2:50 PM

I dont think it is a recent commit that is the culprit.

mikkel.damsgaard

11/24/2022, 2:51 PM

I went back to a commit from a month ago, and could reproduce.

kasperl

11/24/2022, 2:54 PM

Good to know!

mikkel.damsgaard

11/24/2022, 2:56 PM

At least we have a test case that can reproduce it within some minutes. But it is fairly infrequent, but it happens. Anything else I can do to help trace this one down?

mikkel.damsgaard

11/24/2022, 4:09 PM

So, more testing suggests that the frequency of crashes has increased with the latest build.

kasperl

11/24/2022, 4:27 PM

It's faster 🥴

mikkel.damsgaard

11/25/2022, 8:04 AM

My thoughts. It happens when there is high load. It happens when TCP/IP/process unloading is initiated. It seems to happen both on Wifi and Ethernet boards. It looks to be toit heap corruption (class tag, stack pointer).

mikkel.damsgaard

11/25/2022, 8:21 AM

It looks to be class tag 12, whenever I print it out....

mikkel.damsgaard

11/25/2022, 8:26 AM

And by the way, this also seems to happen on our product from time to time. Not only in this contrived testcase. (So it might not have much to do with process unloading, but more to do with high load?)

mikkel.damsgaard

11/25/2022, 9:22 AM

FREERTOS_UNICORE=y, makes it less frequent, but does not eliminate the crash.

kasperl

11/25/2022, 10:12 AM

I think I can reproduce this now. Looking into it.

kasperl

11/25/2022, 4:55 PM

Found one small issue. Fixed on master.

kasperl

11/25/2022, 4:56 PM

Still needs more work.

mikkel.damsgaard

11/26/2022, 3:31 AM

Happy hunting

kasperl

11/28/2022, 10:16 AM

I can confirm that this is reproducible without process unloading and without flash writing. So far, it looks like network load is the only thing required to make this happen.

kasperl

11/29/2022, 6:24 PM

@mikkel.damsgaard I may have found the cause of the crashes (finally). If we preempt a process right before the current task in the process needs to grow the Toit execution stack, we forget to grow the stack when we resume the task. It then overflows the too small stack causing memory corruption, destroying the stack object's header and maybe a bit of the previous object.

kasperl

11/29/2022, 6:24 PM

I expect to land a fix tomorrow.

mikkel.damsgaard

11/29/2022, 6:25 PM

Yay!

kasperl

11/29/2022, 6:26 PM

It gets much more likely under high load with frequent preemptions for cross-process GCs and time sharing.

mikkel.damsgaard

11/29/2022, 6:27 PM

The find definitely fits the observations

kasperl

11/30/2022, 8:25 AM

Fix out for review.

kasperl

11/30/2022, 2:31 PM

Good progress on eliminating the crashes (first significant fix landed today). Still looking into another case where it looks like we may be slightly underestimating the stack space used by a method.

Previous Next