This message was deleted.
# opal
s
This message was deleted.
r
here is a set of logs from one of the containers booting up in case it is useful. I've added a log in Broadcaster on the KafkaBackend connect() function which shows broadcaster doesn't try and make a connection until the
connecting to brokers
log appears
Copy code
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.730037+0000[0m | 13 | [34maiokafka.conn [0m|[1m INFO | Authenticated as root via SCRAM-SHA-512[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.716845+0000[0m | 13 | [34maiokafka.consumer.group_coordinator [0m|[1m INFO | Metadata for topic has changed from {} to {'opal.notifier': 1}. [0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.715076+0000[0m | 13 | [34maiokafka.conn [0m|[1m INFO | Authenticated as root via SCRAM-SHA-512[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.651928+0000[0m | 13 | [34maiokafka.consumer.consumer [0m|[1m INFO | Subscribed to topic(s): {'opal.notifier'}[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.651471+0000[0m | 13 | [34maiokafka.consumer.subscription_state [0m|[1m INFO | Updating subscribed topics to: frozenset({'opal.notifier'})[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.646208+0000[0m | 13 | [34maiokafka.conn [0m|[1m INFO | Authenticated as root via SCRAM-SHA-512[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.630480+0000[0m | 13 | [34maiokafka.conn [0m|[1m INFO | Authenticated as root via SCRAM-SHA-512[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.551847+0000[0m | 13 | [34maiokafka.conn [0m|[1m INFO | Authenticated as root via SCRAM-SHA-512[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.532395+0000[0m | 13 | [34maiokafka.conn [0m|[1m INFO | Authenticated as root via SCRAM-SHA-512[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.523339+0000[0m | 15 | [34muvicorn.protocols.http.httptools_impl [0m|[1m INFO | 127.0.0.1:33922 - "GET /data/config HTTP/1.1" 200[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.520893+0000[0m | 15 | [34mopal_server.data.api [0m|[1m INFO | Serving source configuration[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.450070+0000[0m | 13 | [34mfastapi_websocket_pubsub.event_notifier [0m|[1m INFO | New subscription {'id': '4e296245f86849cc9fda4d285fd7f846', 'subscriber_id': 'e19a3ec0d3b84da395d39c2cc31a6e39', 'topic': 'policy:.', 'callback': <function RpcEventServerMethods.subscribe.<locals>.callback at 0x7f9e11a4e830>, 'notifier_id': None}[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.448352+0000[0m | 16 | [34mfastapi_websocket_pubsub.event_notifier [0m|[1m INFO | New subscription {'id': '8c7d070d2964436ba8ac68e06552ebdf', 'subscriber_id': 'f410fc2875dd4f77a22b182b857887a6', 'topic': 'policy_data', 'callback': <function RpcEventServerMethods.subscribe.<locals>.callback at 0x7f9e126524d0>, 'notifier_id': None}[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.443517+0000[0m | 13 | [34mwebsockets.legacy.server [0m|[1m INFO | connection open[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.441105+0000[0m | 13 | [34mbroadcaster._backends.kafka [0m|[1m INFO | connecting to brokers: [*************][0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.440328+0000[0m | 13 | [34mfastapi_websocket_rpc.websocket_rpc_e...[0m|[1m INFO | Client connected[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.440701+0000[0m | 13 | [34mfastapi_websocket_pubsub.event_broadc...[0m|[1m INFO | Starting broadcaster listener[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.439263+0000[0m | 16 | [34mwebsockets.legacy.server [0m|[1m INFO | connection open[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.439589+0000[0m | 13 | [34mfastapi_websocket_pubsub.event_notifier [0m|[1m INFO | New subscription {'id': '607c57b9478b415b99fa23d0d015a138', 'subscriber_id': '94b9c21614c641cdbd20979c0958726a', 'topic': '__EventNotifier_ALL_TOPICS__', 'callback': <bound method EventBroadcaster.__broadcast_notifications__ of <fastapi_websocket_pubsub.event_broadcaster.EventBroadcaster object at 0x7f9e13108bb0>>, 'notifier_id': None}[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.437230+0000[0m | 16 | [34mfastapi_websocket_pubsub.event_broadc...[0m|[1m INFO | Starting broadcaster listener[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.437698+0000[0m | 16 | [34mbroadcaster._backends.kafka [0m|[1m INFO | connecting to brokers: [*************][0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.436289+0000[0m | 16 | [34mfastapi_websocket_pubsub.event_notifier [0m|[1m INFO | New subscription {'id': 'cec7a87beff34b34bfb58d091e3529c5', 'subscriber_id': 'c832e14415a743138c636989018aadf8', 'topic': '__EventNotifier_ALL_TOPICS__', 'callback': <bound method EventBroadcaster.__broadcast_notifications__ of <fastapi_websocket_pubsub.event_broadcaster.EventBroadcaster object at 0x7f9e13108df0>>, 'notifier_id': None}[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.436886+0000[0m | 16 | [34mfastapi_websocket_rpc.websocket_rpc_e...[0m|[1m INFO | Client connected[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.340107+0000[0m | 13 | [34mfastapi_websocket_pubsub.event_broadc...[0m|[1m INFO | Listening for incoming events from broadcast channel (first listener started)[0m	opal-server-v0-develop
20/02/2023, 14:32:11	[32m2023-02-20T14:32:11.340107+0000[0m | 16 | [34mfastapi_websocket_pubsub.event_broadc...[0m|[1m INFO | Listening for incoming events from broadcast channel (first listener started)[0m	opal-server-v0-develop
20/02/2023, 14:28:58	[32m2023-02-20T14:28:58.180674+0000[0m | 13 | [34mopal_common.sources.git_policy_source [0m|[1m INFO | No new commits: HEAD is at '1736af5c41b0daf5cdacf1ab04148a22db2b92e1'[0m	opal-server-v0-develop
20/02/2023, 14:28:56	[32m2023-02-20T14:28:56.528230+0000[0m | 13 | [34mopal_common.sources.base_policy_source [0m|[1m INFO | Launching polling task, interval: 30000 seconds[0m	opal-server-v0-develop
20/02/2023, 14:28:56	[32m2023-02-20T14:28:56.528676+0000[0m | 13 | [34mopal_common.sources.git_policy_source [0m|[1m INFO | Pulling changes from remote: 'origin'[0m	opal-server-v0-develop
20/02/2023, 14:28:56	[32m2023-02-20T14:28:56.454933+0000[0m | 13 | [34mopal_common.git.repo_cloner [0m|[1m INFO | Clone succeeded[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.849331+0000[0m | 15 | [34mopal_server.server [0m|[1m INFO | *** OPAL Server Startup ***[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.831923+0000[0m | 16 | [34mopal_server.server [0m|[1m INFO | *** OPAL Server Startup ***[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.739119+0000[0m | 15 | [34mopal_server.server [0m|[1m INFO | OPAL was not provided with JWT encryption keys, cannot verify api requests![0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.739566+0000[0m | 15 | [34mopal_server.pubsub [0m|[1m INFO | Initializing broadcaster for server<->server communication[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.736093+0000[0m | 13 | [34mopal_common.git.repo_cloner [0m|[1m INFO | Cloning repo from 'git@github.com:arup-group/ddb-microservices.git' to '/opal/regoclone/opal_repo_clone' (branch: 'main')[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.722936+0000[0m | 13 | [34mopal_server.policy.watcher.task [0m|[1m INFO | Launching policy watcher[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.723125+0000[0m | 13 | [34mopal_common.topics.publisher [0m|[1m INFO | started broadcaster keepalive task: topic is '__broadcast_session_keepalive__', interval is 3600 seconds[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.723538+0000[0m | 13 | [34mopal_server.data.data_update_publisher [0m|[1m INFO | [13] Starting Polling Updates[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.721707+0000[0m | 13 | [34mfastapi_websocket_pubsub.event_notifier [0m|[1m INFO | New subscription {'id': 'd45c16a577e64f9b8621bd8836c5bd8e', 'subscriber_id': '0453c540c2104bca97a3f7060e5894e4', 'topic': 'webhook', 'callback': functools.partial(<function trigger_repo_watcher_pull at 0x7f9e14f1acb0>, <opal_server.policy.watcher.task.PolicyWatcherTask object at 0x7f9e12604940>), 'notifier_id': None}[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.714362+0000[0m | 13 | [34mopal_server.server [0m|[1m INFO | Policy repo will be cloned to: /opal/regoclone/opal_repo_clone[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.713859+0000[0m | 13 | [34mopal_server.server [0m|[1m INFO | listening on webhook topic: 'webhook'[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.649650+0000[0m | 14 | [34mopal_server.server [0m|[1m INFO | *** OPAL Server Startup ***[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.647913+0000[0m | 13 | [34mopal_server.server [0m|[1m INFO | leadership lock acquired, leader pid: 13[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.644194+0000[0m | 13 | [34mopal_server.server [0m|[1m INFO | *** OPAL Server Startup ***[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.644467+0000[0m | 16 | [34mopal_server.pubsub [0m|[1m INFO | Initializing broadcaster for server<->server communication[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.633374+0000[0m | 16 | [34mopal_server.server [0m|[1m INFO | OPAL was not provided with JWT encryption keys, cannot verify api requests![0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.313873+0000[0m | 13 | [34mopal_server.pubsub [0m|[1m INFO | Initializing broadcaster for server<->server communication[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.246043+0000[0m | 13 | [34mopal_server.server [0m|[1m INFO | OPAL was not provided with JWT encryption keys, cannot verify api requests![0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.123590+0000[0m | 14 | [34mopal_server.pubsub [0m|[1m INFO | Initializing broadcaster for server<->server communication[0m	opal-server-v0-develop
20/02/2023, 14:28:34	[32m2023-02-20T14:28:34.118902+0000[0m | 14 | [34mopal_server.server [0m|[1m INFO | OPAL was not provided with JWT encryption keys, cannot verify api requests![0m	opal-server-v0-develop
20/02/2023, 14:28:31	[2023-02-20 14:28:31 +0000] [16] [INFO] Booting worker with pid: 16	opal-server-v0-develop
20/02/2023, 14:28:31	[2023-02-20 14:28:31 +0000] [15] [INFO] Booting worker with pid: 15	opal-server-v0-develop
20/02/2023, 14:28:31	[2023-02-20 14:28:31 +0000] [14] [INFO] Booting worker with pid: 14	opal-server-v0-develop
20/02/2023, 14:28:31	[2023-02-20 14:28:31 +0000] [13] [INFO] Booting worker with pid: 13	opal-server-v0-develop
20/02/2023, 14:28:31	[2023-02-20 14:28:31 +0000] [1] [INFO] Starting gunicorn 20.1.0	opal-server-v0-develop
20/02/2023, 14:28:31	[2023-02-20 14:28:31 +0000] [1] [INFO] Listening at: <http://0.0.0.0:7002> (1)	opal-server-v0-develop
20/02/2023, 14:28:31	[2023-02-20 14:28:31 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker	opal-server-v0-develop
However, I think a health check would do the trick here if there is one?
o
Health check configuration is found here: https://docs.opal.ac/tutorials/healthcheck_policy_and_update_callbacks Such a long multi minute delay, is extremely odd. Will need to review the logs to share more insights
Might be an issue in the broadcaster or configuration with Kafka. Can you try running this with the previous version of the broadcaster, or with another backbone pub/sub (e.g. postgres) just to narrow down the problematic area
r
For the health check: I'd seen this one but had read that is was the health check for the OPAL-client/OPA to confirm it had been populated with data successfully, rather than the OPAL server having connected to the backbone, but perhaps there is something I'm missing here? For the delay: This is consistent with my fork of broadcaster and the permitio broadcaster (using kafka proxy to connect) . The log I added into the top of the broadcaster connect function shows that the broadcaster connection isn't being invoked until after the 'quiet' period in the logs and then connection then happens quickly. So I'm assuming the gap is happening at the level of fastapi_websocket_pubsub or opal-server. However, I could add some more logs in to broadcaster to check from the entry point rather than just the connection to give some more clues. The strange thing is the same container image connects quickly with docker compose locally. However, I'm less concerned if there is a health check that fails if the connection to the backbone hasn't been established. Currently the connection fails silently and seems to run on it's own when in reality we'd want the task to fail.
o
• There isn’t a backbone check health-check for the server ; but if the client received information from the server it tells you something about the server as well. • There’s
/healthcheck
on the server - but it’s just a basic check the server is up • You can add an external health-check by sending an event and listening in on the server or client for receiving it - this is actually the only sure way to check for this layer working with certainty ◦ If the broadcaster was able to connect to Kafka on a basic TCP level (which seems to be the case) - there’s no way for OPAL to know if the channel is working or not / without test messages being sent on it ◦ This can potentially become a builtin mechanism in OPAL As we are not experiencing anything like this with the other backbone pub/sub, and no one else as reported an issue like that - I’d be hard-pressed to think that this is a core OPAL issue. Not impossible - but still unlikely. I’d bet it’s one of the following: • Specific configuration of the Kafka setup, or networking to it (most likely as you say that locally this works in docker-compose) • New broadcaster • Specific configuration provided to OPAL I’d recommend removing these changes and trying with vanilla setup to zero-in on the root-cause here
r
Thanks @Or Weis - I'm fairly happy it's not the broadcaster changes as I get the same behaviour with and without. However, I'll have a play around with it to see if I can narrow it down.
o
What are your thoughts re: networking to Kafka as the culprit ?
r
I think I've isolated this issues on this and yes, I think it was networking on our side. If i remove the service mesh opal-server connects without delay, so that seems to have isolated the issue. Thanks for your help 👍
o
Wonderful to hear @Robert Philp, and thanks for sharing. We're here if you need anything else
🙏 1
r
@Or Weis - The delay is resolved, but I've just been thinking a bit more about the opal-server health check/connection to the backbone. For our Node APIs we await connect to the kafka producers / consumers as part of the server start up, so if the connection fails for any reason the web server never starts causing a simple health check to fail and the server would get ripped down before any traffic is directed to it. I wonder if a similar logic could be applied here? I.e. don't start the socket until the server has connected to the backbone (if the URI is specified in the env vars). It would mitigate the risk right at the start, as an unconnected server (even if unlikely) could have some fairly big knock on effects through the system.
o
To make sure I understand before answering - Where are you applying enforcement? - i.e. you are using OPA to enforce access in the NodeAPI, in Kafka, or elsewhere?
Just reread your text, and I now I realized you’re asking about a change in OPAL and not in your deployment 😅 (Sorry I’m so used to people only asking about their specific needs )
Yes, I’m think what you’re suggesting makes a lot of sense. I think this can already be achieved today with a dependency check - i.e. you run a health-check on Kafka before starting the OPAL server; But I’m not against convince features. Note: it’s not (super) easy to implement - as the check here is not see if you can connect to the broadcaster (that already exists) ; but that the connection is lively end-to-end - This can be done by sending a message from one worker to another. Might make even better sense to make this a periodic keep-alive test. @Asaf Cohen, @Ro'e Katz - WDYT? an option to delay the websocket start on the actual liveness of the backbone pub/sub
r
@Or Weis Sorry for dropping in late, I can share that we were also talking about improving Opal’s health checks to be used by k8s’ liveness & readiness probes in our own cluster. I don’t have a strong opinion on waiting for the broadcaster connection - I guess that makes sense. I know that having an init container that probes for the pod’s dependencies is a common pattern to achieve something similar to docker-compose’s
depends
feature (and avoid crash loop backoffs or other issues).
👍 1