This message was deleted.
# opal
s
This message was deleted.
r
That’s extremely odd. Many of our customers (and Permit.Io itself) are using OPAL with huge loads in production envs without any CPU issues. Can you maybe check which process is using the CPU inside the pod ? Using top or any other solution you are familiar with
b
Hmm the opal server container doesn't have ps or top, any suggestions? I could create a custom Dockerfile if not
Ah nvm found it from the node, let me take a look
Appears to be the opal server gunicorn process
It seems to be that the main gunicorn process is spawning 4 workers in a loop and they're immediately dying and being respawned
(just based on the output of
ps aux | grep gunicorn
from the k8s node - all the gunicorn PIDs are constantly changing except one of them)
There are no processes being OOM killed and the pod has sufficient memory available
o
Quick thoughts here: • Are the OPAL workers throwing any exceptions? • Are they be killed externally? • Maybe a healthcheck misconfiguration?
b
I think I've just figured out that it's related to having
OPAL_STATISTICS_ENABLED=true
set - the workers stop crashing without it
There's this in the logs when I have statistics enabled:
Copy code
[2023-06-09 09:29:32 +0000] [7151] [INFO] Booting worker with pid: 7151
[2023-06-09 09:29:32 +0000] [7120] [INFO] Error while closing socket [Errno 9] Bad file descriptor
[2023-06-09 09:29:32 +0000] [7135] [INFO] Error while closing socket [Errno 9] Bad file descriptor
2023-06-09T09:29:32.990341+0000 | 7141 | opal_server.server                      | INFO  | Trigger worker graceful shutdown
o
Odd, can you share the logs from the server? I'd expect an error in the statistics flow. Otherwise do you maybe have something using the statistics for healthcheck? Can you share your configuration spec? @Ro'e Katz
b
I don't have any healthchecks configured for opal server currently
configmap envvars:
Copy code
UVICORN_NUM_WORKERS=4
OPAL_LOG_LEVEL=DEBUG
OPAL_STATISTICS_ENABLED=true
pod spec:
Copy code
spec:
  replicas: 1
  selector:
    matchLabels:
      app: authz-opal-server
  template:
    metadata:
      labels:
        app: authz-opal-server
        <http://app.kubernetes.io/name|app.kubernetes.io/name>: authz-opal-server
        <http://app.kubernetes.io/component|app.kubernetes.io/component>: server
    spec:
      containers:
        - name: authz-opal-server
          image: permitio/opal-server
          envFrom:
            - configMapRef:
                name: opal-server-env-config
            - secretRef:
                name: opal-server-env-secrets
          env:
          - name: OPAL_BROADCAST_URI
            value: <postgres://postgres:postgres@authz-opal-server-broadcast-service:5432/postgres>
          - name: OPAL_POLICY_REPO_URL
            value: git@localhost:/srv/git/policy      
          - name: OPAL_POLICY_REPO_MAIN_BRANCH
            value: master
          - name: OPAL_POLICY_REPO_POLLING_INTERVAL
            value: "86400"
          - name: OPAL_DATA_CONFIG_SOURCES
            value: '{"external_source_url":"<http://localhost/config.json>"}'
          - name: OPAL_LOG_FORMAT_INCLUDE_PID
            value: "true"
          ports:
            - containerPort: 7002
          resources:
            limits:
              memory: 2Gi
            requests:
              memory: 250Mi
The localhost git and http servers are provided by sidecar containers, shouldn't be relevant I think.
r
@Ben Wallis My guess would be you haven’t configured Broadcaster, or the broadcaster isn’t available. When statistics are enabled, the server immediately tries using it and probably restarts the worker as a result (we should have a better behavior there or at least have a more clear log).
💪 1
b
ah! yes that's it - I had got it configured but late in the process of configuring I moved the postgres secrets to a separate secrets file and forgot to move the OPAL_BROADCAST_URI envvar there too so it was using the default "postgres" password as shown above
Yeah a check to see if the broadcast server is available immediately on startup would be nice
Thanks for the help 🙂
💜 1