This message was deleted.
# opal
s
This message was deleted.
o
Hi @Robert Philp Can you share a little more on the performance issues you are experiencing? You are right in general about the formula for workers to cores . CLIENT_API_SERVER_WORKER_COUNT only affects running as a CLI. In which case, yes it sets the unicorn workers. I'd recommend the same cores logic for the fetching_worker_count. I wouldn't reduce the number of workers for the sake of OPA is it's concurrency isn't tied to any of the other workers (aside from the fetchers, but that's for only a small portion of the writing operations into OPA by the workers)
r
@Or Weis - we are seeing higher than expect durations between the logs in the opal-server and the opal-client ~200ms (typically closer to 15ms for other containers in the same network making rest calls). I set the uvicorn workers in the opal client but this resulted in an error message about the broadcast URI not being set. Is the intention to only ever run the client on a single worker. I didn't think this needed to run with a backbone? I can't see the any difference in the logs when I change the fetching_worker_count. Is this limited by the number of workers? How does this working?
Though we are getting a similar effect in the custom fetcher where logs either side of the http request are showing a gap of ~750ms but the trace within the service being called is only showing a duration of the requests of ~100 - 150ms. Just digging into this to see I can find any more details of why.
o
Yes, I believe that the client has only one worker at a time, aside for the fetchers. The fetcher worker count will only become apperant if you launch multiple concurrent data updates. Interesting, Any other differences between the containers ?
Volume of data maybe?
Could also be indexing time into OPA itself
@Shaul Kremer suggests it can also be policy indexing time
r
Thanks for the suggestions @Or Weis. I've managed to replicate this in a test environment with one opal-client and one opal-server container running. The reason turned out to be a function running synchronously in one of the workers 🙄 However, the behaviour wasn't what I was expecting. If I send 4 kafka messages (aimed at the same custom fetcher) that are then being consumed by opal-server, Opal client is picking each of these up after the synchronous function in the previous fetcher has completed. Hence an increasing delay in Opal Client acknowledging each message in turn. I wasn't expecting the Opal Client to be blocked by a synchronous function in a fetcher and I wasn't expecting one fetcher to block all fetchers when the
FETCHING_WORKER_COUNT
is set to 9. I was assuming this behaviour would only happen once all 9 workers were blocked. Suspect I'm misunderstanding the intention here? This would suggest that
FETCHING_WORKER_COUNT
either isn't being set correctly or isn't doing what I though it was doing. Should this be setting the number of workers for the fetchers?
o
Hi @Robert Philp - I wouldn’t expect this behavior either. Which fetcher are you using? I’m guessing not one we wrote. Can you share the code for the fetcher? Perhaps in addition to having synchronous code - the fetcher’s code is also locking on a shared object or resource of some sort ? forcing all workers to run in async fashion. Of course there could be a bug in our code. But it is a rather classic part of the project that has been changed in over a year. https://github.com/permitio/opal/blob/47e87e547dae8371686fb909278d0e0a4cfb76ab/packages/opal-common/opal_common/fetcher/engine/fetch_worker.py https://github.com/permitio/opal/blob/7fbc2f4896bce0e10700e61ad749e41930edafa5/packages/opal-common/opal_common/fetcher/engine/fetching_engine.py
@Ro'e Katz could use another eye here 😉
r
Yes it's a custom fetcher. The synchronous code was part of an authentication call in the fetchers
__aenter__
function. I'm wording if this explains it? And moving it into the
_fetch_
function would resolve it? We'd really want to be caching the token here and only refreshing when needed. Is there any best practice advice on this in terms structuring this in the fetchers?
o
That actually sounds okay - though how you implemented it can still hide a lock on a shared resource that is causing your problem. To try an zone in on the problem I can suggest you maybe replace the authentication flow you have there, with a fixed cookie / secret - just to test if that part of the code is the culprit.
👍 1
c
Hey - related question on this… https://docs.opal.ac/getting-started/configuration in your docs you have
FETCHING_WORKER_COUNT
- I’m wonder if in practice this needs to be prefixed with
OPAL
(i.e.
OPAL_FETCHING_WORKER_COUNT
) ?
o
Yes. All env-vars need to be prefixed with OPAL_ Without exception.
👍 1
If you missed that, yeah that definitely can be an issue 😅
r
I think we're getting to the bottom of this now. Thanks for you help! One more question on this. Given the client only uses one worker. Does this mean that it only ever uses one processor thread? Or is something else going on with the workers that means they can run across multiple threads. I'm just thinking about the cpu configuration for the client containers.
o
Yes. For the incoming messages (but only for it) OPAL client is single threaded. But it's super lightweight, messages shouldn't contain data. And everything else is multithreaded
👍 1
r
Thanks