MQTT Client connection issue if broker is unavailabe Toit #help

MQTT Client connection issue if broker is unavaila...

Informatic0re

11/15/2023, 3:35 PM

If the MQTT Client connects to an IP address which has no device behind, the client crashes with "connection refused" and tries again after a few seconds. If the MQTT Client connects to an IP Address where only the broker itself is unavailable but the device is available (pingable) the code crashes with the same exception "connection refused" but seem to not pause in between retries. This might lead to unwanted behavior of the containers. In our case a separate BLE container kept running but ble stopped advertising for some reason (I assume some memory overflow or something similar) Any idea of why that is?

kasperl

11/15/2023, 3:40 PM

Summoning @floitsch ...

floitsch

11/15/2023, 3:41 PM

Hmm. It should always pause.

floitsch

11/15/2023, 3:41 PM

Could you have a look with logging enabled?

floitsch

11/15/2023, 3:42 PM

Would also be good to know whether it really was an OOM.

Informatic0re

11/15/2023, 3:45 PM

it never showed an OOM

Informatic0re

11/15/2023, 3:45 PM

I try to reproduce it right now

Informatic0re

11/15/2023, 3:46 PM

so even tho it does not pause right now, it also does not crash anything, it just retries in a loop

Informatic0re

11/15/2023, 3:47 PM

but it crashes here:

Copy code

client = mqtt.Client 
    --host=host
    --routes=routes

when I am creating the client. So there must be a difference if the server device is available or not somehow in the code

Informatic0re

11/15/2023, 3:48 PM

lets call it the "fast looping" only happens if the server is reachable but the broker is deactivated ( I killed the mosquitto broker service)

Informatic0re

11/15/2023, 3:49 PM

if the server is unrachable so even a ping would not reach it, the loop seems reasonable

Informatic0re

11/15/2023, 3:52 PM

okay interesting - now the ble stopped advertising! no crash or anything. It crashed repeatedly with

Copy code

******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.120>
******************************************************************************
EXCEPTION error. 
Connection refused
  0: TcpSocket.connect         <sdk>\net\modules\tcp.toit:151:40
  1: TcpSocket.connect         <sdk>\net\modules\tcp.toit:141:12
  2: Client.tcp-connect        <sdk>\net\net.toit:110:12
  3: Client.tcp-connect        <sdk>\net\net.toit:101:12
  4: ReconnectingTransport_.new-connection_ <pkg:mqtt>\tcp.toit:132:21
  5: ReconnectingTransport_.reconnect.<block> <pkg:mqtt>\tcp.toit:120:22
  6: Mutex.do.<monitor-block>  <sdk>\monitor.toit:28:27
  7: __Monitor__.locked_.<block> <sdk>\core\monitor_impl_.toit:123:12
  8: __Monitor__.locked_       <sdk>\core\monitor_impl_.toit:95:3
  9: Mutex.do                  <sdk>\monitor.toit:28:3
 10: ReconnectingTransport_.reconnect <pkg:mqtt>\tcp.toit:112:25
 11: ReconnectingTransport_    <pkg:mqtt>\tcp.toit:94:5
 12: TcpTransport              <pkg:mqtt>\tcp.toit:33:12
 13: Client                    <pkg:mqtt>\client.toit:54:18
 14: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-464b3efa-86cf-43f1-a66e-8b58389ac9e3\clone\src\services\mqtt\mqtt.toit:71:12******************************************************************************

for many many times and suddenly the crashes stopped being printed. But I see that the ble watchdog is still running, which tells me that the ble container is still running so the task which is keeping the advertisement died or got stuck or something.

floitsch

11/15/2023, 4:00 PM

So from what I can see: when

mqtt.Client --...

can't connect it immediately throws an exception like the one you show.

floitsch

11/15/2023, 4:00 PM

However, it is not, by itself, trying to reconnect.

floitsch

11/15/2023, 4:01 PM

If you see a loop, then that's because you (or Artemis) is trying to connect again.

floitsch

11/15/2023, 4:01 PM

I'm guessing you don't have a

catch

around that part of the code. -> The program crashes. However, you marked the container as critical (or something similar), and the program is started immediately again and it tries to start the MQTT again.

floitsch

11/15/2023, 4:03 PM

There are now three ways to avoid this: - change the container's description to not be critical, or interval 0s. If it is interval 1s, it would wait 1s before starting the program again. - catch the exception of

mqtt.Client

and retry again after an appropriate timeout. - we change the mqtt library to go through the "normal" reconnection strategy even for the first connection. I think I changed it away from that, because users didn't get a nice "wrong credentials, ..." when they tried to start the client. Instead the program seemed to hang.

Informatic0re

11/15/2023, 4:03 PM

true - the loop comes from artemis restarting the container

Informatic0re

11/15/2023, 4:04 PM

ok I will try to remove the critical part and make interval to be 1s

Informatic0re

11/15/2023, 4:08 PM

I made the interval 1s and removed critical=true, uploaded, flashed, but after the crash it is not restarting the container. it stopped, but it does not start again

Informatic0re

11/15/2023, 4:10 PM

Copy code

"mqtt": {
    ... github bla bla
    "background": true,
    "interval": "1s"
}

kasperl

11/15/2023, 4:12 PM

Is that the right syntax?

Informatic0re

11/15/2023, 4:13 PM

the interval part was there already but with 0s

kasperl

11/15/2023, 4:13 PM

Copy code

"containers": {
  "measure": {
    "entrypoint": "measure.toit",
    "triggers": [ { "interval": "20s" } ]
  }
}

kasperl

11/15/2023, 4:13 PM

I think that's the usual example.

Informatic0re

11/15/2023, 4:14 PM

okay, did you changed that?

kasperl

11/15/2023, 4:14 PM

Maybe I am wrong and there is a shortcut.

Informatic0re

11/15/2023, 4:14 PM

not sure where I got my version from

floitsch

11/15/2023, 4:16 PM

Could be that we changed it.

Informatic0re

11/15/2023, 4:17 PM

the configs are quite old tho or at least the content. it works now

Informatic0re

11/15/2023, 4:17 PM

I think thats an acceptable solution

Informatic0re

11/15/2023, 4:18 PM

made it 10s now

floitsch

11/15/2023, 4:39 PM

I'm planning on creating JSON schemas for the specification files. That should make it easier to manipulate them.

kasperl

11/16/2023, 8:06 AM

@floitsch Should we complain about unrecognized entries?

floitsch

11/16/2023, 10:25 AM

I guess we should at least warn.

Previous Next