jag run crashes toit vm on ESP32 and ESP32S3
# help
m
When I try to run the attached program with jag run, the vm sometimes crashes when I run jag run again and again. It is not consistent. It happens less often on the esp32 than on thes esp32s3. This is on latests master branch of toit and jaguar.
k
I wonder if this is reproducible with
1fc17422b7448fdf0f605648d6ae5552e58ae71a
?
m
So far, I havent been able to reproduce on that commit.
And now it happened on that commit.
Copy code
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x403754ea  PS      : 0x00060030  A0      : 0x82024bda  A1      : 0x3fce5950  
0x403754ea: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262

A2      : 0x00000000  A3      : 0x3fce1018  A4      : 0x3c0e1ec8  A5      : 0x00000002  
A6      : 0x00000000  A7      : 0x3fcee958  A8      : 0x803754e5  A9      : 0x3fce5930  
A10     : 0x3009ff14  A11     : 0x3fca0474  A12     : 0x3fca0478  A13     : 0x0000abab  
A14     : 0x00060423  A15     : 0x00060420  SAR     : 0x00000003  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x420ce0ec  LEND    : 0x420ce0f2  LCOUNT  : 0x00000002  
0x420ce0ec: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:938

0x420ce0f2: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:941



Backtrace:0x403754e7:0x3fce59500x42024bd7:0x3fce59d0 0x42024e66:0x3fce5a20 0x42024e91:0x3fce5a50 0x4201760e:0x3fce5a70 0x42017635:0x3fce5a90 
0x403754e7: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262

0x42024bd7: toit::Scheduler::run_process(toit::Locker&, toit::Process*, toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:627

0x42024e66: toit::Scheduler::run(toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:404

0x42024e91: toit::SchedulerThread::entry() at ./toolchains/esp32s3/build/./src/scheduler.cc:38

0x4201760e: toit::Thread::_boot() at ./toolchains/esp32s3/build/./src/os_esp32.cc:302

0x42017635: toit::thread_start(void*) at ./toolchains/esp32s3/build/./src/os_esp32.cc:289
 (inlined by) esp_thread_start at ./toolchains/esp32s3/build/./src/os_esp32.cc:294
FYI: I am using this script to run an extended test.
Copy code
#!/bin/bash
while :
do
  /Users/mikkel/proj/jaguar/build/jag run service_perf_test.toit -d 192.168.1.129
  sleep 2
done
Also reproduced in 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9 After quite a while
The log from a 10 minute or so run:
On 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9
I dont think it is a recent commit that is the culprit.
I went back to a commit from a month ago, and could reproduce.
k
Good to know!
m
At least we have a test case that can reproduce it within some minutes. But it is fairly infrequent, but it happens. Anything else I can do to help trace this one down?
So, more testing suggests that the frequency of crashes has increased with the latest build.
k
It's faster 🥴
m
My thoughts. It happens when there is high load. It happens when TCP/IP/process unloading is initiated. It seems to happen both on Wifi and Ethernet boards. It looks to be toit heap corruption (class tag, stack pointer).
It looks to be class tag 12, whenever I print it out....
And by the way, this also seems to happen on our product from time to time. Not only in this contrived testcase. (So it might not have much to do with process unloading, but more to do with high load?)
FREERTOS_UNICORE=y, makes it less frequent, but does not eliminate the crash.
k
I think I can reproduce this now. Looking into it.
Found one small issue. Fixed on master.
Still needs more work.
m
Happy hunting
k
I can confirm that this is reproducible without process unloading and without flash writing. So far, it looks like network load is the only thing required to make this happen.
@mikkel.damsgaard I may have found the cause of the crashes (finally). If we preempt a process right before the current task in the process needs to grow the Toit execution stack, we forget to grow the stack when we resume the task. It then overflows the too small stack causing memory corruption, destroying the stack object's header and maybe a bit of the previous object.
I expect to land a fix tomorrow.
m
Yay!
k
It gets much more likely under high load with frequent preemptions for cross-process GCs and time sharing.
m
The find definitely fits the observations
k
Fix out for review.
Good progress on eliminating the crashes (first significant fix landed today). Still looking into another case where it looks like we may be slightly underestimating the stack space used by a method.