MQTT Client connection issue if broker is unavaila...
# help
i
If the MQTT Client connects to an IP address which has no device behind, the client crashes with "connection refused" and tries again after a few seconds. If the MQTT Client connects to an IP Address where only the broker itself is unavailable but the device is available (pingable) the code crashes with the same exception "connection refused" but seem to not pause in between retries. This might lead to unwanted behavior of the containers. In our case a separate BLE container kept running but ble stopped advertising for some reason (I assume some memory overflow or something similar) Any idea of why that is?
k
Summoning @floitsch ...
f
Hmm. It should always pause.
Could you have a look with logging enabled?
Would also be good to know whether it really was an OOM.
i
it never showed an OOM
I try to reproduce it right now
so even tho it does not pause right now, it also does not crash anything, it just retries in a loop
but it crashes here:
Copy code
client = mqtt.Client 
    --host=host
    --routes=routes
when I am creating the client. So there must be a difference if the server device is available or not somehow in the code
lets call it the "fast looping" only happens if the server is reachable but the broker is deactivated ( I killed the mosquitto broker service)
if the server is unrachable so even a ping would not reach it, the loop seems reasonable
okay interesting - now the ble stopped advertising! no crash or anything. It crashed repeatedly with
Copy code
******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.120>
******************************************************************************
EXCEPTION error. 
Connection refused
  0: TcpSocket.connect         <sdk>\net\modules\tcp.toit:151:40
  1: TcpSocket.connect         <sdk>\net\modules\tcp.toit:141:12
  2: Client.tcp-connect        <sdk>\net\net.toit:110:12
  3: Client.tcp-connect        <sdk>\net\net.toit:101:12
  4: ReconnectingTransport_.new-connection_ <pkg:mqtt>\tcp.toit:132:21
  5: ReconnectingTransport_.reconnect.<block> <pkg:mqtt>\tcp.toit:120:22
  6: Mutex.do.<monitor-block>  <sdk>\monitor.toit:28:27
  7: __Monitor__.locked_.<block> <sdk>\core\monitor_impl_.toit:123:12
  8: __Monitor__.locked_       <sdk>\core\monitor_impl_.toit:95:3
  9: Mutex.do                  <sdk>\monitor.toit:28:3
 10: ReconnectingTransport_.reconnect <pkg:mqtt>\tcp.toit:112:25
 11: ReconnectingTransport_    <pkg:mqtt>\tcp.toit:94:5
 12: TcpTransport              <pkg:mqtt>\tcp.toit:33:12
 13: Client                    <pkg:mqtt>\client.toit:54:18
 14: main                      C:\Users\Mirko\AppData\Local\Temp\artemis-464b3efa-86cf-43f1-a66e-8b58389ac9e3\clone\src\services\mqtt\mqtt.toit:71:12******************************************************************************
for many many times and suddenly the crashes stopped being printed. But I see that the ble watchdog is still running, which tells me that the ble container is still running so the task which is keeping the advertisement died or got stuck or something.
f
So from what I can see: when
mqtt.Client --...
can't connect it immediately throws an exception like the one you show.
However, it is not, by itself, trying to reconnect.
If you see a loop, then that's because you (or Artemis) is trying to connect again.
I'm guessing you don't have a
catch
around that part of the code. -> The program crashes. However, you marked the container as critical (or something similar), and the program is started immediately again and it tries to start the MQTT again.
There are now three ways to avoid this: - change the container's description to not be critical, or interval 0s. If it is interval 1s, it would wait 1s before starting the program again. - catch the exception of
mqtt.Client
and retry again after an appropriate timeout. - we change the mqtt library to go through the "normal" reconnection strategy even for the first connection. I think I changed it away from that, because users didn't get a nice "wrong credentials, ..." when they tried to start the client. Instead the program seemed to hang.
i
true - the loop comes from artemis restarting the container
ok I will try to remove the critical part and make interval to be 1s
I made the interval 1s and removed critical=true, uploaded, flashed, but after the crash it is not restarting the container. it stopped, but it does not start again
Copy code
"mqtt": {
    ... github bla bla
    "background": true,
    "interval": "1s"
}
k
Is that the right syntax?
i
the interval part was there already but with 0s
k
Copy code
"containers": {
  "measure": {
    "entrypoint": "measure.toit",
    "triggers": [ { "interval": "20s" } ]
  }
}
I think that's the usual example.
i
okay, did you changed that?
k
Maybe I am wrong and there is a shortcut.
i
not sure where I got my version from
f
Could be that we changed it.
i
the configs are quite old tho or at least the content. it works now
I think thats an acceptable solution
made it 10s now
f
I'm planning on creating JSON schemas for the specification files. That should make it easier to manipulate them.
k
@floitsch Should we complain about unrecognized entries?
f
I guess we should at least warn.