Unable to OTA after upgrading to version 35
# help
t
Hi After upgrading to version 35, we are not able to do OTA. Our device retrieves the image via http, and the image is written in the same way as install_firmware in jaguar.toit. But after a few writes, it looks like the write method gets stuck, and the app halts without any indication. The only way to get it out of this is a hard reset. The device is already running with 35, so there shouldn't be an issue with the partitions.
I have added a 2 ms sleep after each write, and now the ota works sometimes. Sometimes after committing and resetting, I get a panic:
Firmware updated; doing chip reset ets Jun 8 2016 00:22:57 rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT) configsip: 0, SPIWP:0xee clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00 mode:DIO, clock div:2 load:0x3fff0030,len:184 load:0x40078000,len:12700 ho 0 tail 12 room 4 load:0x40080400,len:2916 entry 0x400805c4 E (633) psram: PSRAM ID read error: 0xffffffff E (633) spiram: SPI RAM enabled but initialization failed. Bailing out. [toit] INFO: starting [wifi] DEBUG: connecting Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled. Memory dump at 0x4018d45c: f03df03d 00000000 ffffbf00 Core 1 register dump: PC : 0x4018d460 PS : 0x00060c30 A0 : 0x8018a7cd A1 : 0x3ffd8840 A2 : 0x00000001 A3 : 0x3ffc8d80 A4 : 0x00100000 A5 : 0x09ee0087 A6 : 0x008d2018 A7 : 0x00000003 A8 : 0x3ffc8df2 A9 : 0x00000006 A10 : 0x00000011 A11 : 0x0000003b A12 : 0x3ffc8d91 A13 : 0x3ffc07fc A14 : 0x0000003c A15 : 0x00000000 SAR : 0x00000007 EXCCAUSE: 0x00000000 EXCVADDR: 0x00000000 LBEG : 0x4018d449 LEND : 0x4018d45c LCOUNT : 0x00000000
****************************************************************************** Decoding by
jag
, device has version ****************************************************************************** WARNING: extracting a specific part is deprecated Crash in native code: Backtrace:0x4018d43d:0x3ffd88400x4018a7ca:0x3ffd8860 0x4018b23a:0x3ffd8880 0x40145659:0x3ffd88b0 0x401456b9:0x3ffd88d0 0x4016e60a:0x3ffd88f0 0x4016ed75:0x3ffd8910 0x4016d6b4:0x3ffd8930 0x401adb81:0x3ffd8950 0x4018d43d: tx_pwctrl_init + 0x31 0x4018b23a: register_chipv7_phy + 0x11e 0x40145659: esp_phy_load_cal_and_init + 0x3d 0x401456b9: esp_phy_enable + 0x51 0x4016e60a: wifi_hw_start + 0x76 0x4016ed75: wifi_start_process + 0x49 0x4016d6b4: ieee80211_ioctl_process + 0x5c 0x401adb81: ppTask + 0xb1 ****************************************************************************** ELF file SHA256: 799463178f1588c9
k
So the writes you are referring to are
FirmwareWriter.write
, right?
How much other stuff is running while you do the OTA?
t
Nevermind 😅 . It was a power issue. It works just as before 👍
I think it's safe to delete this topic.
So it seems it wasn't a power issue after all. When the mqtt traffic is low, it works fine, but with just a few messages each second, the writer stops everything after a while. Even if I spawn the firmware writing.
k
So lots of WiFi traffic kills the firmware updating?
t
It seems so.
Let me know, if I should try someting specific. I've tried different things but nothing seems to make it better. My biggest concern is that the app stops completely - I haven't found a way to avoid that.
k
Is it an option to stop the MQTT client when you're doing firmware updates or is that super unattractive?
t
It could be possible.
If I stop all other tasks, the ota succeeds most of the times. Sometimes it results in a panic, but other times it just stops executing, which is quite critical, since a hard reset is required. Any idea what happens when it stops completely?
k
I'm looking into it! There's an odd case in the firmware writer that we might need to talk about, but I doubt this fully explains what you're seeing.
t
👍 🙂
k
Found a pretty bad bug in our RPC system that we introduced lately (it happens when we're temporarily low on memory).
Introduced in alpha.34, I think.
t
Interesting. That could explain some of it...
I tried it out, but unfortunately I still have the same behaviour.
k
I'll keep looking. Did you just patch in the change locally and forced a rebuild? The PR just landed.
t
Yes, I just copied it in 🙂
I am trying to debug the problem a bit by placing ets_printf in different locations, and so far I have come to the conclusion that it gets stuck in PRIMITIVE(ota_write) in the call to esp_partition_erase_range - at least it doesn't get to the ets_printf I placed just after that.
k
That is super useful. Thanks 🙏
m
@tplux When you are doing erase_range, any code that is not in iram can not run. We modified the iram setup in 35. I wonder if that is the cause.
Although the ESP32 should disable any code that tries to run during a flash exclusive write, I have seen this fail.
(Mostly when trying to access memory in SPIRAM, this is not relevant here though)
t
Ok. That's worth looking into.
Which files should I look into to see what changed in the iram setup?
m
sdkconfig basicly. There are some IRAM config parameters.
t
I'm still strugling with this. Regarding the IRAM config, I'm wary of messing with it, since I guess there's a good reason for the current setup. I did try to set CONFIG_SPIRAM_CACHE_LIBNUMPARSER_IN_IRAM=y, but I was not able to link because there was not enough space in the iram section. Instead I'm stopping all tasks not handling the ota to minimize any unwanted activity - this seemed to work for a while. I tested by updating between 3 bin files: test1.bin updates to test2.bin test2.bin updates to test3.bin test3.bin updates to test1.bin and so on. I don't know if it's a coincidence, but it started failing again when I accidentally messed up the order, so test3.bin updated to test2.bin, and then test2.bin was not able to update to test1.bin. BTW, the "failing" is still everything stopping leaving only a hard reset to save the day - that's not viable.
k
We're here to help. Do you have time for a chat tomorrow?
t
Absolutely 🙂
k
We just had a brief discussions about this and @floitsch remembered that we ended up moving FreeRTOS out of IRAM as part of the upgrade to alpha.35. This way due to PSRAM support, so if you don't need that maybe we can find a way of going back to having FreeRTOS in IRAM by disabling the PSRAM compiler workarounds that make generated code (and thus IRAM requirements) grow.
f
Fwiw, the documentation states that it should be safe to not have them in iram:
Copy code
Provided these functions are not (incorrectly) used from ISRs, this option is safe to enable in all configurations.
I will still try to get a version with the Freertos in IRAM to work again.
t
Thanks 🙂
m
@tplux Are you trying to access the sdcard while flashing? Assuming you use the sdcard code.
f
@tplux: I wasn't able to put freertos back into iram. Wasn't just the flag for the external RAM... However. If you don't use Bluetooth, you could disable that and put freertos into iram instead.
Copy code
diff --git a/toolchains/esp32/sdkconfig.defaults b/toolchains/esp32/sdkconfig.defaults
index 3947a427..7f18132a 100644
--- a/toolchains/esp32/sdkconfig.defaults
+++ b/toolchains/esp32/sdkconfig.defaults
@@ -6,7 +6,7 @@ CONFIG_ESPTOOLPY_FLASHSIZE_4MB=y
 CONFIG_PARTITION_TABLE_CUSTOM=y
 CONFIG_COMPILER_OPTIMIZATION_SIZE=y
 CONFIG_COMPILER_OPTIMIZATION_ASSERTIONS_DISABLE=y
-CONFIG_BT_ENABLED=y
+#CONFIG_BT_ENABLED=y
 CONFIG_BTDM_CTRL_BLE_MAX_CONN=2
 CONFIG_BTDM_CTRL_PINNED_TO_CORE_1=y
 # CONFIG_BTDM_CTRL_MODEM_SLEEP is not set
@@ -57,7 +57,7 @@ CONFIG_FREERTOS_THREAD_LOCAL_STORAGE_POINTERS=3
 CONFIG_FREERTOS_IDLE_TASK_STACKSIZE=1024
 CONFIG_FREERTOS_ISR_STACKSIZE=2096
 CONFIG_FREERTOS_USE_TRACE_FACILITY=y
-CONFIG_FREERTOS_PLACE_FUNCTIONS_INTO_FLASH=y
+#CONFIG_FREERTOS_PLACE_FUNCTIONS_INTO_FLASH=y
 CONFIG_LOG_DEFAULT_LEVEL_ERROR=y
 # CONFIG_LOG_COLORS is not set
 CONFIG_LWIP_LOCAL_HOSTNAME="toit"
Then delete the
toolchains/esp32/sdkconfig
file and build the image again. You can also just run
idf.py -C toolchains/esp32 -B build/esp32 size-components
(assuming you have sourced the export.sh:
source third_party/esp-idf/export.sh
).
t
I am using the sdcard code, but to rule that out, I tried without it, and unfortunately that didn't help.
I will try this out. For now at least we don't need BT.
It seems disabling BT and putting freertos back into iram did the trick. I will keep trying to see if it fails again.
k
Very interesting. @dongheng might know why we need to keep FreeRTOS in IRAM.
m
Wow. A bit read eared here. I just checked our sdkconfig on the ESP32 with working flash, and we have indeed done now what is descibed here:
Copy code
# CONFIG_BT_ENABLED is not set
# CONFIG_FREERTOS_PLACE_FUNCTIONS_INTO_FLASH is not set
# CONFIG_FREERTOS_PLACE_SNAPSHOT_FUNS_INTO_FLASH is not set
I wonder if there is a bigger problem with OTA in toit on ESP32 now....
d
Hi You can see the information I shared in
general
channel: Flash writing/reading/erasing has much usage limitation, as I known mainly about: 1. Task stack mustn't be PSRAM, becasue the cache for PSRAM will be disable 2. Interrupt(flags=ESP_INTR_FLAG_IRAM) process functions(include calling function one by one layer) should be put into IRAM, because cache for flash will be disable 3. All CPUs will switch to special tasks(IPC), and other tasks can't run and it looks like being blocked
Some FreeRTOS functions are called in interrupt service, especially semaphore, queue and some task related.
f
@erikcorry and @kasperl might correct me, but for Toit we should have: 1. the stack is not in PSRAM. We barely started using PSRAM, as we had difficulties with the ESP32 errata (and our compile flags weren't working around the issue). 2. I don't think we wrote our own interrupt handlers. They should all be from the esp-idf, and should therefore in in iram. 3. Not sure about that one. There shouldn't be any dependency on other tasks. Also wouldn't explain why things start to work when we put freertos back into iram. @dongheng: do you know if there is an easy way to see where the issue comes from? Which function is trying to access something that it shouldn't do?
m
I think I placed panic handlers in IRAM at some point, that helped.
CONFIG_ESP_PANIC_HANDLER_IRAM+
f
@tplux if you have a few minutes at some point, it would be interesting to see what you get with the CONFIG_ESP_PANIC_HANDLER_IRAM config.
m
And ofc moving the FreeRtos back to flash and enable BT 🙂
Also, another hack I used when fiddling with this, is to just kick the toitvm out of IRAM to have plenty to work with. It is the interpreter.h line 27
t
Sure, I'll let you know how it goes.
With freertos back in flash, BT enabled and CONFIG_ESP_PANIC_HANDLER_IRAM=y the iram section overflows.
f
Try without bluetooth.
The current hypothesis is that freertos needs to be in iram to make the flashing work. According to the docs that shouldn't be the case. If we put freertos back into flash, and have the panic handler enabled, we might get an error message telling us which function tries to access a freertos function.
t
Got it. On it.
d
Sometimes the panic can show the backtrace, so we can know where casue this issue, but to see more details we should de-assembly this ELF file and check the space which is the address value of backtrace.
t
****************************************************************************** Decoding by
jag
, device has version ****************************************************************************** WARNING: extracting a specific part is deprecated Crash in native code: Backtrace:0x401910c1:0x3ffb12700x40086b03:0x3ffb1290 0x4008be95:0x3ffb1300 0x4008849a:0x3ffaf430 0x4008c3bf:0x3ffaf450 0x401910c1: prvAcquireItemNoSplit + 0x51 0x4008be95: _xt_lowint1 + 0x89 0x4008849a: spi_flash_op_block_func + 0x22 0x4008c3bf: ipc_task + 0x6f ******************************************************************************
d
Could you send the ELF and bin to me ?
t
The ELF file is too big for discord... 😬
How do I check it?
d
I see the backtrace, but it looks like not all corrent. Maybe
spi_flash_op_block_func
triggers issue, but also looks like a interrupt triggers by
_xt_lowint1
, so I can't check directly.
Could you push your branch to github ? Then I can pull and run it
t
If you can settle for the elf file, I think my colleague has nitro, so she can send bigger files...
f
@tplux it looks like Jaguar is decoding the native stack-trace the wrong way. Try to set the
JAG_TOIT_REPO_PATH
environment variable to your Toit repository. If you already did that, then use the
esp-idf monitor
instead. Let me find the command line for the
esp-idf monitor
line.
Make sure that you have sourced
third_party/esp-idf/export.sh
. Then run
idf.py -C toolchains/esp32 -B build/esp32 monitor
That should, hopefully, print the correct trace.
The reason for the mix-up, is that Jaguar assumes that the firmware that is running, is the one that Jaguar knows about. I ran into the same thing when I was looking at the ESP32-S2. We just discussed this, and will try to make this more stable.
t
Guru Meditation Error: Core 0 panic'ed (Cache disabled but cached memory region accessed). Core 0 register dump: PC : 0x401910c4 PS : 0x00060034 A0 : 0x80086b06 A1 : 0x3ffb1270 0x401910c4: prvCheckItemFitsByteBuffer at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/esp_ringbuf/ringbuf.c:325 A2 : 0x3ffe0ed0 A3 : 0x3ffe0b24 A4 : 0x0000001c A5 : 0x3ffb12b8 A6 : 0x401910c4 A7 : 0x3ffe0fc4 A8 : 0x8009261d A9 : 0x3ffb1230 0x401910c4: prvCheckItemFitsByteBuffer at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/esp_ringbuf/ringbuf.c:325 A10 : 0x3ffe0ed0 A11 : 0x0000001c A12 : 0x3ffb5080 A13 : 0x0000cdcd A14 : 0x00060023 A15 : 0x00060021 SAR : 0x00000000 EXCCAUSE: 0x00000007 EXCVADDR: 0x00000000 LBEG : 0x4008fc94 LEND : 0x4008fcb0 LCOUNT : 0xffffffff 0x4008fc94: memcpy at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:175 0x4008fcb0: memcpy at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:197 Backtrace:0x401910c1:0x3ffb12700x40086b03:0x3ffb1290 0x4008be95:0x3ffb1300 0x40008544:0x3ffc64a0 0x400882c1:0x3ffc64c0 0x4009c7bd:0x3ffc64e0 0x4009c5c1:0x3ffc6510 0x40087d87:0x3ffc6530 0x4010176b:0x3ffc6580 0x400ecfa3:0x3ffc65a0 0x40083c1e:0x3ffc6630 0x400f1f20:0x3ffc66c0 0x400f21c6:0x3ffc6710 0x400f21f5:0x3ffc6740 0x400e4e16:0x3ffc6760 0x400e4e3d:0x3ffc6780
0x401910c1: prvAcquireItemNoSplit at ringbuf.c:? 0x40086b03: uart_rx_intr_handler_default at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/driver/uart.c:978 0x4008be95: _xt_lowint1 at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1111 0x400882c1: delay_us at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/spi_flash/spi_flash_os_func_app.c:148 0x4009c7bd: spi_flash_chip_generic_wait_idle at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/spi_flash/spi_flash_chip_generic.c:443 0x4009c5c1: spi_flash_chip_generic_erase_sector at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/spi_flash/spi_flash_chip_generic.c:162 0x40087d87: esp_flash_erase_region at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/spi_flash/esp_flash_api.c:616 0x4010176b: esp_partition_erase_range at /home/tommy/remoni_repos/toit/third_party/esp-idf/components/spi_flash/partition.c:545 0x400ecfa3: toit::primitive_ota_write(toit::Process*, toit::Object**) at ./build/esp32/./src/primitive_esp32.cc:162 (inlined by) primitive_ota_write at ./build/esp32/./src/primitive_esp32.cc:116 0x40083c1e: toit::Interpreter::run() at ./build/esp32/./src/interpreter_run.cc:966 0x400f1f20: toit::Scheduler::run_process(toit::Locker&, toit::Process*, toit::SchedulerThread*) at ./build/esp32/./src/scheduler.cc:627 0x400f21c6: toit::Scheduler::run(toit::SchedulerThread*) at ./build/esp32/./src/scheduler.cc:404 0x400f21f5: toit::SchedulerThread::entry() at ./build/esp32/./src/scheduler.cc:38 0x400e4e16: toit::Thread::_boot() at ./build/esp32/./src/os_esp32.cc:308 0x400e4e3d: toit::thread_start(void*) at ./build/esp32/./src/os_esp32.cc:295 (inlined by) esp_thread_start at ./build/esp32/./src/os_esp32.cc:300
d
Fisrtly we can see that:
Cache disabled but cached memory region accessed
, it means cache for external memory(flash or psram) is disable, but software try to access it,. Then
0x401910c4
show where it triggers issues, if there its ASM code, it will be better...
The process I check is: 1. erase flash -> close cache and wait for it is done 2. a UART IRQ triggers and send data to ring_buffer 3. crash in ring_buffer's code which access flash or psram
why the address is
0x401910c4
, I think this out of range of IRAM...
m
In this case there is no PSRAM attached.
ringbuf.c is probably not in iram?
d
IRAM space is from 0x40080000 to 0x400a0000.
m
Ah !
The esp_ringbuf is placed in IRAM if the FREERTOS is in IRAM.
esp-idf/components/esp_ringbug/linker.lf line 5
d
Flash space is from 0x400d0020, so 0x401910c4 may be the flash space...
m
That is the reason.
f
I think Mikkel is right. (And Kasper came to the same conclusion): While the chache is disabled, the UART rx interrupt handler is called which triggers a call to the
esp_ringbuf
. Two options: - disable uart interrupt - move esp_ringbuf into iram
t
So if I close down the uart before starting ota, the uart interrupt should be disabled?
f
@tplux do you have something connected to a serial port that is feeding data to the ESP32?
t
Yes
m
@tplux Worth a shot
t
I'll try
f
That would be an option, but fundamentally, this looks like a bug in the esp-idf.
And for now, you can just disable bluetooth and have freertos in iram.
While you don't use bluetooth, having freertos in iram is just better.
It would still be interesting to see if that solves the issue. Confirming our hypothesis.
t
Confirmed 🙂
d
@floitsch @tplux Actually this maybe not a bug of the esp-idf, it is a usage limitation of esp-idf's UART driver, and you can disable option UART_ISR_IN_IRAM( https://github.com/toitware/esp-idf/blob/wip-allocate-toit-backwards/components/driver/Kconfig#L150) , then UART interrupt will not work when erase/read/write flash.
k
@dongheng The "bug" in esp-idf is that it feels like the ringbuffer code should be forced into IRAM if UART_ISR_IN_IRAM is set -- even if all of FreeRTOS isn't put in IRAM. Alternatively, the UART interrupt should be disabled when reading/writing/erasing flash if the ringbuffer code isn't in IRAM.
d
Yes, it is. The issue is about option dependence in Kconfig.
m
I believe it is unfortunate that the documentation for the
CONFIG_FREERTOS_PLACE_FUNCTIONS_INTO_FLASH
states
Copy code
Provided these functions are not (incorrectly) used from ISRs, this option is safe to enable in all configurations.
and then the esp-idt UART interrupt driver is dependent on that flag..
For now, I believe that toit should probably kick the UART driver out of IRAM to avoid this particular nasty little side effect.
Side note: In our application, we get the OTA data on the UART, so for us there is no way around making space in IRAM for the UART and FreeRTOS, otherwise the UART will drop data packages and the OTA will fail.
t
BTW, while testing all this I sometimes experienced that after flashing a new image to the ESP, it started an old app. I assume this happens because of ota, where the bootloader at some point wants to start from the ota1 partition, but the flash tool only flashes the ota0 partition. Does that make sense?
m
How do you flash? esptool/jag?
There is an ota data partition, if that is not cleared, then the boot loader might indeed start the "wrong" app.
t
I use esptool.
k
Which version of esptool are you using, @tplux? I seem to remember that this has been fixed in newer versions.
t
v3.3.2-dev
k
Yeah, I think I got that wrong. What was fixed was the default arguments provided by the "build environment". You might be able to flash using tools/firmware instead (if you're not on Windows).
t
Interesting. I will look into that 👍
k
(it invokes tools/esptool behind the scenes)
t
Is there a how-to somewhere? Might have missed it 🙂
k
I think we may not have written much about since it doesn't yet work on Windows.
t
No worries. I'll try to deduce from the code.
m
idf.py flash will add the ota data partition by itself, so might be worth considering. Atleast the output of idf.py build will include the ota data partition piece that is needed to make esptool work.
k
~/.cache/jaguar/sdk/tools/firmware -e ~/.cache/jaguar/assets/firmware-esp32.envelope flash -p /dev/tty.usbserial-11330
(this is just an example of using this from the stuff installed with Jaguar by default)
Copy code
esptool.py v3.3.2-dev
Serial port /dev/tty.usbserial-11330
Connecting..............
Chip is ESP32-D0WD (revision 1)
Features: WiFi, BT, Dual Core, 240MHz, VRef calibration in efuse, Coding Scheme None
Crystal is 40MHz
MAC: 98💿ac:63:76:cc
Uploading stub...
Running stub...
Stub running...
Changing baud rate to 921600
Changed.
Configuring flash size...
Auto-detected Flash size: 4MB
Flash will be erased from 0x00001000 to 0x00004fff...
Flash will be erased from 0x00008000 to 0x00008fff...
Flash will be erased from 0x0000d000 to 0x0000efff...
Flash will be erased from 0x00010000 to 0x00144fff...
Compressed 15888 bytes to 11151...
Wrote 15888 bytes (11151 compressed) at 0x00001000 in 0.4 seconds (effective 319.6 kbit/s)...
Hash of data verified.
Compressed 3072 bytes to 165...
Wrote 3072 bytes (165 compressed) at 0x00008000 in 0.1 seconds (effective 328.1 kbit/s)...
Hash of data verified.
Compressed 8192 bytes to 31...
Wrote 8192 bytes (31 compressed) at 0x0000d000 in 0.1 seconds (effective 511.9 kbit/s)...
Hash of data verified.
Compressed 1263200 bytes to 812457...
Wrote 1263200 bytes (812457 compressed) at 0x00010000 in 13.9 seconds (effective 725.9 kbit/s)...
Hash of data verified.
The output shows that this is also writing the
otadata
partition.
m
Because, of course every one knows that the otadata partition starts at 0x00d000 🙂
t
Seems straight forward.
I guess adding "0x00d000 toit/build/esp32/ota_data_initial.bin" to the call to esptool should also do the trick?
k
Yup. That's the theory 😃