This message was deleted.
# opal
s
This message was deleted.
o
Thanks for reporting @Jake Youssefzadeh @Ro'e Katz, @Ori Shavit, @Asaf Cohen Please take a look.
BTW @Jake Youssefzadeh - I'd expect the server to close the subscription once it gets this error. Are you sure it's not the client reconnecting and then the error repeats?
j
yeah i also expect the server to close the connection. how can i verify the client is reconnecting if i force two sequential data updates both of them will have this error
o
If you set OPAL_LOG_LEVEL=DEBUG you should see the subscription id (which is the client id).
j
when these failures come the subscriber ID in the logs is not found in the opal statistics page
indicating that client is indeed not connected to the server
o
So it does seem to be reconnecting, but potentially you are experiencing a loop of such disconnects. How often does this happen?
j
its not reproducible in the sense that it happens after some trigger, but it always happens eventually and the failure log does not seem to go away until after couple hours/days i guess the connection is being expired from the server
r
It looks like a bug I’ve recently fixed - and just released a new version with the fix couple of hours ago (0.6.1) - so possibly re-pulling the image can fix that. Are you using postgres as broadcaster? (I saw you have statistics enabled which is the second condition)
j
yes, using postgres and have statistics enabled
will try out that latest image thanks!
πŸ‘ 1
FYI still seeing this issue with OPAL server version 0.7.0
would the client version also need to be updated? from the changes i thought this was a server only fix
o
Hi Jake - you should always have server and clients versions match. And I'm not sure on which side @Ro'e Katz added the fix he referenced here. He'll probably be able to answer when he wakes up in a few hours Also: @Ro'e Katz, @Asaf Cohen - maybe we should take on the naive approach here, and simply add code to forcibly remove the notifier subscription once an exception like this occurs, or occurs more than X counter.
πŸ‘ 1
j
would these issues be resolved if i switched to rabbitmq or kafka instead of postgres
r
Hi @Jake Youssefzadeh, Sorry that you’re still seeing that issue. The fix is indeed in the server, but as @Or Weis mentioned you’ll be much safer to have matching versions on both sides. Just to clarify - It is expected every once in a while for the server to get its broadcaster connection closed, which makes it also close websockets with connected clients. The fixes bug was the clients failing to reconnect after that. Are you seeing a client failing to reconnect? To be sure, can you please share logs again? (from both server and client). It’s hard to say if switching from postgres would help - we had couple of issues specifically with postgres, but those should be fixed by now.
j
oh, im just seeing in the server logs the same failed to notify unexpected message after websocket close, is that error log still expected to be there?
r
It’s ok if it happens as long as things get reconnected again. I will look into @Or Weis’s suggestion to see if we can get rid of that error altogether
h
Hi @Ro'e Katz we have just deployed opal-server and opal-client in to our aks cluster and we start seeing similar issue surface as well. We have about 8 opal-client subscribing to a topic and when opal-server trying to notify the subscriber of the data changes to the subscribe topic only about 4-5 opal-client get updated and opal server is throwing the similar exception (
Connection reset by peer
rather than
websocket.close
) :
Copy code
Failed to notify subscriber sub_id=1f5dcc7db5304c8ba96e663090af6c41 with topic=sampleTopic←[0m
←[33m←[1mTraceback (most recent call last):←[0m

  File "/usr/local/bin/gunicorn", line 33, in <module>
    sys.exit(load_entry_point('gunicorn==20.1.0', 'console_scripts', 'gunicorn')())
    β”‚   β”‚    β”” <function importlib_load_entry_point at 0x7fd4717b2e60>
    β”‚   β”” <built-in function exit>
    β”” <module 'sys' (built-in)>
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
    β”” <class 'gunicorn.app.wsgiapp.WSGIApplication'>
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/base.py", line 231, in run
    super().run()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
    β”‚       β”” <gunicorn.app.wsgiapp.WSGIApplication object at 0x7fd471793ee0>
    β”” <class 'gunicorn.arbiter.Arbiter'>
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 202, in run
    self.manage_workers()
    β”‚    β”” <function Arbiter.manage_workers at 0x7fd470c93250>
    β”” <gunicorn.arbiter.Arbiter object at 0x7fd47139dff0>
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 551, in manage_workers
    self.spawn_workers()
    β”‚    β”” <function Arbiter.spawn_workers at 0x7fd470c93370>
    β”” <gunicorn.arbiter.Arbiter object at 0x7fd47139dff0>
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 622, in spawn_workers
    self.spawn_worker()
    β”‚    β”” <function Arbiter.spawn_worker at 0x7fd470c932e0>
    β”” <gunicorn.arbiter.Arbiter object at 0x7fd47139dff0>
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
    β”‚      β”” <function UvicornWorker.init_process at 0x7fd46f28a050>
    β”” <uvicorn.workers.UvicornWorker object at 0x7fd46d727940>
  File "/usr/local/lib/python3.10/site-packages/uvicorn/workers.py", line 66, in init_process
    super(UvicornWorker, self).init_process()
          β”‚              β”” <uvicorn.workers.UvicornWorker object at 0x7fd46d727940>
          β”” <class 'uvicorn.workers.UvicornWorker'>
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 142, in init_process
    self.run()
    β”‚    β”” <function UvicornWorker.run at 0x7fd46f28a290>
    β”” <uvicorn.workers.UvicornWorker object at 0x7fd46d727940>
  File "/usr/local/lib/python3.10/site-packages/uvicorn/workers.py", line 98, in run
    return asyncio.run(self._serve())
           β”‚       β”‚   β”‚    β”” <function UvicornWorker._serve at 0x7fd46f28a200>
           β”‚       β”‚   β”” <uvicorn.workers.UvicornWorker object at 0x7fd46d727940>
           β”‚       β”” <function run at 0x7fd470825d80>
           β”” <module 'asyncio' from '/usr/local/lib/python3.10/asyncio/__init__.py'>
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
           β”‚    β”‚                  β”” <coroutine object UvicornWorker._serve at 0x7fd46d4dad50>
           β”‚    β”” <method 'run_until_complete' of 'uvloop.loop.Loop' objects>
           β”” <uvloop.Loop running=True closed=False debug=False>
> File "/usr/local/lib/python3.10/site-packages/fastapi_websocket_pubsub/event_notifier.py", line 220, in callback_subscribers
    await self.trigger_callback(data, topic, subscriber_id, event)
          β”‚    β”‚                β”‚     β”‚      β”‚              β”” Subscription(id='1eb80928fa9744d2a354449ee611c35c', subscriber_id='1f5dcc7db5304c8ba96e663090af6c41', topic='sampleTopic',...
          β”‚    β”‚                β”‚     β”‚      β”” '1f5dcc7db5304c8ba96e663090af6c41'
          β”‚    β”‚                β”‚     β”” 'sampleTopic'
          β”‚    β”‚                β”” DataUpdate(id='e6fbd2a4-17cb-4e38-90ac-466b1ac34dfa', entries=[DataSourceEntry(url='<http://my-test-url>....
          β”‚    β”” <function EventNotifier.trigger_callback at 0x7fd46ed1ee60>
          β”” <fastapi_websocket_pubsub.websocket_rpc_event_notifier.WebSocketRpcEventNotifier object at 0x7fd46d4df7c0>
  File "/usr/local/lib/python3.10/site-packages/fastapi_websocket_pubsub/event_notifier.py", line 178, in trigger_callback
    await subscription.callback(subscription, data)
          β”‚            β”‚        β”‚             β”” DataUpdate(id='e6fbd2a4-17cb-4e38-90ac-466b1ac34dfa', entries=[DataSourceEntry(url='<http://my-test-url>....
          β”‚            β”‚        β”” Subscription(id='1eb80928fa9744d2a354449ee611c35c', subscriber_id='1f5dcc7db5304c8ba96e663090af6c41', topic='sampleTopic',...
          β”‚            β”” <bound method EventBroadcaster.__broadcast_notifications__ of <fastapi_websocket_pubsub.event_broadcaster.EventBroadcaster ob...
          β”” Subscription(id='1eb80928fa9744d2a354449ee611c35c', subscriber_id='1f5dcc7db5304c8ba96e663090af6c41', topic='sampleTopic',...
  File "/usr/local/lib/python3.10/site-packages/fastapi_websocket_pubsub/event_broadcaster.py", line 160, in __broadcast_notifications__
    await self._sharing_broadcast_channel.publish(self._channel, note.json())
          β”‚    β”‚                          β”‚       β”‚    β”‚         β”‚    β”” <cyfunction BaseModel.json at 0x7fd46fef3510>
          β”‚    β”‚                          β”‚       β”‚    β”‚         β”” BroadcastNotification(notifier_id='1f5dcc7db5304c8ba96e663090af6c41', topics=['sampleTopic'], data=DataUpdate(id='e6fbd2a4...
          β”‚    β”‚                          β”‚       β”‚    β”” 'EventNotifier'
          β”‚    β”‚                          β”‚       β”” <fastapi_websocket_pubsub.event_broadcaster.EventBroadcaster object at 0x7fd46d4df880>
          β”‚    β”‚                          β”” <function Broadcast.publish at 0x7fd46ed1f880>
          β”‚    β”” <broadcaster._base.Broadcast object at 0x7fd46d41c370>
          β”” <fastapi_websocket_pubsub.event_broadcaster.EventBroadcaster object at 0x7fd46d4df880>
  File "/usr/local/lib/python3.10/site-packages/broadcaster/_base.py", line 87, in publish
    await self._backend.publish(channel, message)
          β”‚    β”‚        β”‚       β”‚        β”” '{"notifier_id": "1f5dcc7db5304c8ba96e663090af6c41", "topics": ["sampleTopic"], "data": {"id": "e6fbd2a4-17cb-4e38-90ac-46...
          β”‚    β”‚        β”‚       β”” 'EventNotifier'
          β”‚    β”‚        β”” <function RedisBackend.publish at 0x7fd46c8ff7f0>
          β”‚    β”” <broadcaster._backends.redis.RedisBackend object at 0x7fd46d41cac0>
          β”” <broadcaster._base.Broadcast object at 0x7fd46d41c370>
  File "/usr/local/lib/python3.10/site-packages/broadcaster/_backends/redis.py", line 34, in publish
    await self._pub_conn.publish(channel, message)
          β”‚    β”‚                 β”‚        β”” '{"notifier_id": "1f5dcc7db5304c8ba96e663090af6c41", "topics": ["sampleTopic"], "data": {"id": "e6fbd2a4-17cb-4e38-90ac-46...
          β”‚    β”‚                 β”” 'EventNotifier'
          β”‚    β”” Connection(host='<http://dt-pgmdev-redis.redis.cache.windows.net|dt-pgmdev-redis.redis.cache.windows.net>', port=6380)
          β”” <broadcaster._backends.redis.RedisBackend object at 0x7fd46d41cac0>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 661, in wrapper
    result = await method(protocol_self, _NoTransaction, *a, **kw)
                   β”‚      β”‚              β”‚                β”” ('EventNotifier', '{"notifier_id": "1f5dcc7db5304c8ba96e663090af6c41", "topics": ["sampleTopic"], "data": {"id": "e6fbd2a4...
                   β”‚      β”‚              β”” <asyncio_redis.protocol._NoTransactionType object at 0x7fd46ca5afb0>
                   β”‚      β”” <asyncio_redis.protocol.RedisProtocol object at 0x7fd46c94a5f0>
                   β”” <function RedisProtocol.publish at 0x7fd46ca99b40>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 1153, in _query
    result = await self._get_answer(
                   β”‚    β”” <function RedisProtocol._get_answer at 0x7fd46ca89c60>
                   β”” <asyncio_redis.protocol.RedisProtocol object at 0x7fd46c94a5f0>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 1080, in _get_answer
    result = await answer_f
                   β”” <Future finished exception=ConnectionLostError(ConnectionResetError(104, 'Connection reset by peer'))>
o
Hi @Heng Tan, @Ro'e Katz is still actively working on this issue, but he is AFK at the moment, please expect a reply a little later. CC: @Asaf Cohen, @Shaul Kremer
πŸ‘ 1
r
@Or Weis @Heng Tan My understanding of @Jake Youssefzadeh’s current issue is that he was mainly concerned about the ugly logs. As I said above it’s expected to sometimes have disconnections if client are reconnecting and getting all data. @Heng Tan are you describing a situation where some clients remain outdated? Are you using a postgres broadcaster, Do you have statistics enabled?
h
Hi @Ro'e Katz after look at the opal server log, its throwing lots of :
Copy code
[32m2023-06-09T12:35:30.054783+0000[0m | [34masyncio_redis.protocol                  [0m|[1m INFO  | Connection lost with exec: [Errno 104] Connection reset by peer[0m
[32m2023-06-09T12:35:30.054917+0000[0m | [34masyncio_redis.protocol                  [0m|[1m INFO  | Connection lost with exec: [Errno 104] Connection reset by peer[0m
[32m2023-06-09T12:35:30.055842+0000[0m | [34masyncio_redis.protocol                  [0m|[1m INFO  | Redis connection lost[0m
[32m2023-06-09T12:35:30.056104+0000[0m | [34masyncio_redis.protocol                  [0m|[1m INFO  | Connection lost with exec: [Errno 104] Connection reset by peer[0m
[32m2023-06-09T12:35:30.056467+0000[0m | [34masyncio_redis.connection                [0m|[1m INFO  | Connecting to redis[0m
it doesn't look like its a normal behaviour
r
@Heng Tan And are you sure your Redis is healthy and url is configured correctly?
h
i'll double check that get back to you πŸ˜‰
@Ro'e Katz i did some testing over the weekend. I notice that the following errors only happening when I turn on TLS setting for Redis server:
Copy code
2023-06-12T08:50:23.462496+0000 | 9 | asyncio_redis.protocol                  | INFO  | EOF received in RedisProtocol
2023-06-12T08:50:23.462738+0000 | 10 | asyncio_redis.protocol                  | INFO  | EOF received in RedisProtocol
2023-06-12T08:50:23.463164+0000 | 9 | asyncio_redis.protocol                  | INFO  | Redis connection lost
2023-06-12T08:50:23.463312+0000 | 10 | asyncio_redis.protocol                  | INFO  | Redis connection lost
2023-06-12T08:50:23.463787+0000 | 10 | asyncio_redis.connection                | INFO  | Connecting to redis
2023-06-12T08:50:23.463856+0000 | 9 | asyncio_redis.connection                | INFO  | Connecting to redis
2023-06-12T08:50:23.506188+0000 | 9 | asyncio_redis.protocol                  | INFO  | Redis connection made
2023-06-12T08:50:23.510307+0000 | 10 | asyncio_redis.protocol                  | INFO  | Redis connection made
2023-06-12T08:50:29.244720+0000 | 9 | asyncio_redis.protocol                  | INFO  | EOF received in RedisProtocol
2023-06-12T08:50:29.245040+0000 | 9 | asyncio_redis.protocol                  | INFO  | Redis connection lost
2023-06-12T08:50:29.245355+0000 | 9 | asyncio_redis.connection                | INFO  | Connecting to redis
2023-06-12T08:50:29.288743+0000 | 9 | asyncio_redis.protocol                  | INFO  | Redis connection made
I was using the connection string that you were suggested previously:
<rediss://password@redis-host-name:6380>
. Do I need to specify anything in the connection string to let it Redis driver know that its a TLS is enabled on Redis side? I notice that this is how its done in redis-py : https://redis-py.readthedocs.io/en/stable/examples/ssl_connection_examples.html. But i notice broadcaster is using asyncio_redis which doesn't really mentioned if its support TLS / SSL Connection
r
I believe it should be rediss://:password@redis-host-name:6380 (notice the β€œ:” which separates an empty username from the password). LMK if that works
h
I'll give that a try
I have tested with the new connection string as you suggested and the opal server log provides a more detail log on why is not working when TLS is enabled :
Copy code
2023-06-12T19:21:47.845851+0000 | 10 | fastapi_websocket_pubsub.event_notifier | INFO  | New subscription {'id': 'c74ecf060f76434eab5b459d89e6c3de', 'subscriber_id': '7f0610cc0edd404190a6f4dbc7b16e4f', 'topic': 'userPermission', 'callback': <function RpcEventServerMethods.subscribe.<locals>.callback at 0x7f7d6036a290>, 'notifier_id': None}
2023-06-12T19:21:47.877262+0000 | 10 | asyncio_redis.protocol                  | INFO  | Redis connection made
2023-06-12T19:21:47.913409+0000 | 9 | uvicorn.protocols.http.httptools_impl   | INFO  | 192.168.112.8:52514 - "GET /policy?path=. HTTP/1.1" 503
2023-06-12T19:21:47.927722+0000 | 8 | opal_server.data.api                    | INFO  | Source configuration is available at '<http://aecom-pgm-security-opaldatasync/opalsource/v1>', redirecting with token=eyJhb...lzKvg (abbrv.)
2023-06-12T19:21:47.928426+0000 | 8 | uvicorn.protocols.http.httptools_impl   | INFO  | 192.168.112.8:52518 - "GET /data/config HTTP/1.1" 307
2023-06-12T19:21:48.777394+0000 | 9 | uvicorn.protocols.http.httptools_impl   | INFO  | 192.168.112.8:52526 - "GET /policy?path=. HTTP/1.1" 503
2023-06-12T19:21:49.997220+0000 | 9 | uvicorn.protocols.http.httptools_impl   | INFO  | 192.168.112.8:52532 - "GET /policy?path=. HTTP/1.1" 503
2023-06-12T19:21:52.847305+0000 | 9 | uvicorn.protocols.http.httptools_impl   | INFO  | 192.168.112.8:52534 - "GET /policy?path=. HTTP/1.1" 503
2023-06-12T19:21:53.753826+0000 | 9 | uvicorn.protocols.http.httptools_impl   | INFO  | 192.168.112.8:52538 - "GET /policy?path=. HTTP/1.1" 503
2023-06-12T19:22:37.665634+0000 | 10 | asyncio_redis.protocol                  | INFO  | EOF received in RedisProtocol
2023-06-12T19:22:37.666739+0000 | 10 | asyncio_redis.protocol                  | INFO  | EOF received in RedisProtocol
2023-06-12T19:22:37.669608+0000 | 10 | asyncio_redis.protocol                  | INFO  | Redis connection lost
2023-06-12T19:22:37.673117+0000 | 10 | asyncio_redis.protocol                  | INFO  | Redis connection lost
2023-06-12T19:22:37.675849+0000 | 10 | asyncio.runners                         |ERROR  | Task exception was never retrieved
future: <Task finished name='Task-16' coro=<RedisProtocol.connection_made.<locals>.initialize() done, defined at /usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py:836> exception=ConnectionLostError(None)>
Traceback (most recent call last):

> File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 839, in initialize
    await self.auth(self.password)
          β”‚    β”‚    β”‚    β”” '<redis password>'
          β”‚    β”‚    β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f6a10>
          β”‚    β”” <function RedisProtocol.auth at 0x7f7d602fae60>
          β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f6a10>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 661, in wrapper
    result = await method(protocol_self, _NoTransaction, *a, **kw)
                   β”‚      β”‚              β”‚                β”‚    β”” {}
                   β”‚      β”‚              β”‚                β”” ('omB7xbxhMDWboF8redissG1t8Mm8Gc0hkenMqz4AzCaOvvyLc=',)
                   β”‚      β”‚              β”” <asyncio_redis.protocol._NoTransactionType object at 0x7f7d602c5360>
                   β”‚      β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f6a10>
                   β”” <function RedisProtocol.auth at 0x7f7d602e17e0>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 1153, in _query
    result = await self._get_answer(
                   β”‚    β”” <function RedisProtocol._get_answer at 0x7f7d602e1630>
                   β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f6a10>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 1080, in _get_answer
    result = await answer_f
                   β”” <Future finished exception=ConnectionLostError(None)>

asyncio_redis.exceptions.ConnectionLostError: None
2023-06-12T19:22:37.694176+0000 | 10 | asyncio_redis.connection                | INFO  | Connecting to redis
2023-06-12T19:22:37.695024+0000 | 10 | asyncio.runners                         |ERROR  | Task exception was never retrieved
future: <Task finished name='Task-18' coro=<RedisProtocol.connection_made.<locals>.initialize() done, defined at /usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py:836> exception=ConnectionLostError(None)>
Traceback (most recent call last):

> File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 839, in initialize
    await self.auth(self.password)
          β”‚    β”‚    β”‚    β”” '<redis password>'
          β”‚    β”‚    β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f5b10>
          β”‚    β”” <function RedisProtocol.auth at 0x7f7d602fae60>
          β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f5b10>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 661, in wrapper
    result = await method(protocol_self, _NoTransaction, *a, **kw)
                   β”‚      β”‚              β”‚                β”‚    β”” {}
                   β”‚      β”‚              β”‚                β”” ('omB7xbxhMDWboF8redissG1t8Mm8Gc0hkenMqz4AzCaOvvyLc=',)
                   β”‚      β”‚              β”” <asyncio_redis.protocol._NoTransactionType object at 0x7f7d602c5360>
                   β”‚      β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f5b10>
                   β”” <function RedisProtocol.auth at 0x7f7d602e17e0>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 1153, in _query
    result = await self._get_answer(
                   β”‚    β”” <function RedisProtocol._get_answer at 0x7f7d602e1630>
                   β”” <asyncio_redis.protocol.RedisProtocol object at 0x7f7d602f5b10>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 1080, in _get_answer
    result = await answer_f
                   β”” <Future finished exception=ConnectionLostError(None)>

asyncio_redis.exceptions.ConnectionLostError: None
looks like first its showing
EOF received in RedisProtocol
then it says the
Redis connection lost
and finally thrown en exception when the asyncio-redis driver is await for the result of
_get_answer
When I turn off TLS for redis everything is working fine (but not ideal considering its not a secure channel) and I manage to get all 8 opal-clients data source syncing working correctly when sending an data config update to opal server. I also notice asyncio-redis drive is no longer maintained by the founder and haven't received any update since August 2020. 😞 I think its worth considering swapping the asyncio-redis to redis-py as its the official driver maintained by Redis.
a
@Shaul Kremer @Ro'e Katz i think as part of removing encode/broadcaster from the system, we should also consider following @Heng Tan’s advice about swapping the
asyncio-redis
library to
redis-py
as its the official driver maintained by Redis.
r
@Asaf Cohen Alright, That is a good idea. @Heng Tan - I’m a bit confused - you did see secure redis working before as a broadcaster right?
h
@Ro'e Katz i think as you pointed out in a different thread because the older version of the opal client doesn't throw any error if there's issue with the connection to the redis or opal server we assume it was working with TLS. When we actually test it in a proper AKS environment with opal version 0.7.0 with redis TLS turn on it seem to be only syncing 1 or 2 opal-client out of the 8 opal-client πŸ€”. In the end we decide to deploy the opal server with Redis (TLS turn off) communicating via private endpoint, and the data sync for all 8 opal-client seems working fine.
πŸ‘ 1
@Ro'e Katz We encountered another error yesterday on the opal-server throwing the following error and the data sync stop working again:
Copy code
[32m2023-06-20T12:14:53.105896+0000[0m | [34masyncio.runners                         [0m|[31m[1mERROR  | Task exception was never retrieved
future: <Task finished name='Task-23009' coro=<RedisProtocol.connection_made.<locals>.initialize() done, defined at /usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py:836> exception=Error('Cannot run command inside pubsub subscription.')>[0m
[33m[1mTraceback (most recent call last):[0m

> File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 839, in initialize
    await self.auth(self.password)
          Γâé    Γâé    Γâé    Γââ '2ZrRNoB6613awJtoiWpeYxv74SAT9c75BAzCaBcLIpc='
          Γâé    Γâé    Γââ <asyncio_redis.protocol.RedisProtocol object at 0x7fcdf2471660>
          Γâé    Γââ <function RedisProtocol.auth at 0x7fcdf246b760>
          Γââ <asyncio_redis.protocol.RedisProtocol object at 0x7fcdf2471660>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 649, in wrapper
    raise Error("Cannot run command inside pubsub subscription.")
          Γââ <class 'asyncio_redis.exceptions.Error'>

[31m[1masyncio_redis.exceptions.Error[0m:[1m Cannot run command inside pubsub subscription.[0m
[32m2023-06-20T12:14:53.106483+0000[0m | [34masyncio_redis.protocol                  [0m|[1m INFO  | Redis connection made[0m
[32m2023-06-20T12:14:53.106706+0000[0m | [34masyncio_redis.protocol                  [0m|[1m INFO  | Redis connection made[0m
[32m2023-06-20T12:14:53.106756+0000[0m | [34masyncio.runners                         [0m|[31m[1mERROR  | Task exception was never retrieved
future: <Task finished name='Task-36557' coro=<RedisProtocol.connection_made.<locals>.initialize() done, defined at /usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py:836> exception=Error('Cannot run command inside pubsub subscription.')>[0m
[33m[1mTraceback (most recent call last):[0m

> File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 839, in initialize
    await self.auth(self.password)
          Γâé    Γâé    Γâé    Γââ '2ZrRNoB6613awJtoiWpeYxv74SAT9c75BAzCaBcLIpc='
          Γâé    Γâé    Γââ <asyncio_redis.protocol.RedisProtocol object at 0x7fcdf24636d0>
          Γâé    Γââ <function RedisProtocol.auth at 0x7fcdf2467400>
          Γââ <asyncio_redis.protocol.RedisProtocol object at 0x7fcdf24636d0>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 649, in wrapper
    raise Error("Cannot run command inside pubsub subscription.")
          Γââ <class 'asyncio_redis.exceptions.Error'>

[31m[1masyncio_redis.exceptions.Error[0m:[1m Cannot run command inside pubsub subscription.[0m
[32m2023-06-20T12:14:53.107023+0000[0m | [34masyncio.runners                         [0m|[31m[1mERROR  | Task exception was never retrieved
future: <Task finished name='Task-6264' coro=<RedisProtocol.connection_made.<locals>.initialize() done, defined at /usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py:836> exception=Error('Cannot run command inside pubsub subscription.')>[0m
[33m[1mTraceback (most recent call last):[0m

> File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 839, in initialize
    await self.auth(self.password)
          Γâé    Γâé    Γâé    Γââ '2ZrRNoB6613awJtoiWpeYxv74SAT9c75BAzCaBcLIpc='
          Γâé    Γâé    Γââ <asyncio_redis.protocol.RedisProtocol object at 0x7fcdf245f1c0>
          Γâé    Γââ <function RedisProtocol.auth at 0x7fcdf246b640>
          Γââ <asyncio_redis.protocol.RedisProtocol object at 0x7fcdf245f1c0>
  File "/usr/local/lib/python3.10/site-packages/asyncio_redis/protocol.py", line 649, in wrapper
    raise Error("Cannot run command inside pubsub subscription.")
          Γââ <class 'asyncio_redis.exceptions.Error'>

[31m[1masyncio_redis.exceptions.Error[0m:[1m Cannot run command inside pubsub subscription.[0m
after doing some investigation its seems to be related back to the asyncio redis driver again: https://groups.google.com/g/aio-libs/c/NOZ9bkz0AvU https://github.com/jonathanslenders/asyncio-redis/issues/82 I thinks this gives more reason to replace the old asyncio redis driver. But for the moment we can't really afford opal-server data sync keep crashing, do you have a recommendation what is the most stable and reliable broadcast channel to use other than Redis? Postgres or Kafka?
r
@Heng Tan - Sorry you’re experiencing those issues. Was the data sync fixed on a restart? Do you have any tips on how I might reproduce? My first recommendation would be to try turning off the statistics feature - It seems there might be a bug there that affects the ability of OPAL to recover from broadcaster disconnections (And currently I don’t really want to invest in hunting it down because we’re working on a much better substitute for statistics in the next major of OPAL). As for replacing Redis - we internally work with Postgres, which lately seems to be very stable in our production env (After fixing some issues we had with the
broadcaster
pkg couple of months ago). Thanks for the feedback regarding
asyncio-redis
! - as @Asaf Cohen mentioned, we might have an opportunity to replace it soon.
h
@Ro'e Katz thanks for the recommendation, I have double checked that we have STATISTICS_ENABLED turn off and OPA_HEALTH_CHECK_POLICY_ENABLED turn on so not sure if would have contribute to any of the issue experience. I can confirm that restarting the opal-server seems working fine and its seems running ok so far. Many thanks for your help.
r
You welcome! I don’t think
OPA_HEALTH_CHECK_POLICY_ENABLED
has to do with these issues (and overall if you don’t need OPA to serve OPAL’s health check, there’s no need to turn it on, opal exposes its own health check. Also - that variable belongs to the client not the server).