This message was deleted Permit #opal

Join Slack

This message was deleted.

# opal

Slackbot

06/15/2023, 11:02 AM

This message was deleted.

opal-client-log.txt

Raz Co

06/15/2023, 11:39 AM

Hey @Ben Wallis, I’m taking a look at this, thank you for sharing the log, it’s very useful.

Ben Wallis

06/15/2023, 12:02 PM

No problem, thanks for looking into it 🙂

Raz Co

06/15/2023, 12:29 PM

It does look like the client gets the

OPA transaction failed

because of the connection failure to the API. That means that basically OPAL Client couldn’t load the initial data to OPA. What happens when you try to deploy the Client separately, while the API is ready to serve requests ? Depending on the answer here, you might want to change the retry amount of the data updater, or add lifecycle hooks to the containers (need to research on how to implement it in this case).

Ben Wallis

06/15/2023, 12:34 PM

It always works if the first try succeeds - in my particular case I have 3 services each with their own pods which each have an OPAL Client sidecar container within their pods - 2 of those services have APIs referenced in

OPAL_DATA_CONFIG_SOURCES

. I can't ever guarantee which other pods will be available at the point that OPAL Client starts. According to the docs here https://docs.opal.ac/tutorials/healthcheck_policy_and_update_callbacks#-opa-healthcheck-policy if I enable the health check policy it'll only return

ready

when the initial data sources from

OPAL_DATA_CONFIG_SOURCES

have been successfully synced. Doesn't this suggest that there should be infinite retries until this succeeds? Otherwise the pod would be stuck with a failed healthcheck indefinitely.

Ben Wallis

06/15/2023, 12:52 PM

Sidenote, I just tried to implement the readiness probe in my k8s config, but the ready endpoint

/v1/data/system/opal/healthy

returns HTTP 200 with a content of

{"result":false}

when it's not ready - this means this endpoint can't be used as a http readinessProbe target since it treats HTTP 200 as "ready"

Ben Wallis

06/15/2023, 1:06 PM

I've worked around this with a derived Dockerfile that installs

curl

which enables me to configure a readinessProbe like so:

Copy code

readinessProbe:
  exec:
    command:
    - sh
    - -c
    - curl -s <http://localhost:8181/v1/data/system/opal/healthy> | grep 'true'
  failureThreshold: 10
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 5

This doesn't solve the problem though - it just means I can now correctly identify the application pod as unsuitable for usage because the OPA data hasn't been populated.

Ben Wallis

06/15/2023, 1:33 PM

So I think I understand the issue more now - the reason it gives up after 10 seconds is due to the default value of

OPAL_FETCHING_CALLBACK_TIMEOUT

- it doesn't care if it succeeds after this time which is why the OPA transaction fails. However, I think I have a bigger problem. As I mentioned earlier some of my services with OPAL Client sidecar containers are the source of part of their own authorization data - I have the fetcher config configured with k8s service addresses, which causes a problem for the OPAL Client sidecar that needs to fetch from the service within the same pod as it. When doing a new deployment the cluster service DNS is still pointing to the previous instance of the pod because the new one isn't ready yet. I think I might need to think about using topics to allow those services to use

localhost

instead of a cluster dns address. Perhaps I'm doing something more fundamentally wrong with this design though 😅

Raz Co

06/15/2023, 2:15 PM

First of all, the thing with healthy returns 200 is a problem, i’ll prioritize this asap. The thing with your deployment can be easily solved with

localhost

like you said. There is no reason to set 2 (or more) containers within one Pod and communicate between them using the load-balancing component (service or deployment fqdn), that’s a bad practice when it comes to sidecar containers.

Ben Wallis

06/15/2023, 2:16 PM

I found the problem - it was directly related to having the OPAL Client container in a container within the same pod as a service that it needs to query data from. That service container had

initialDelaySeconds

set to 30 meaning it was never available before

OPAL_FETCHING_CALLBACK_TIMEOUT

expired. As for why I'm using the service fqdn, the initial configuration as specified in

OPAL_DATA_CONFIG_SOURCES

comes from OPAL Server right? There's only one instance of that configuration, and if I have Service A, Service B and Service C that all have OPAL Client sidecars that need data from Service A then I need a URL in that data config that can be accessed by all 3 services, not just Service A, so localhost wouldn't work

Ben Wallis

06/15/2023, 2:52 PM

I've solved the problem with a hack - by using

hostAliases

I can configure a pod to resolve its own service fqdn to 127.0.0.1:

Copy code

hostAliases:
      - ip: "127.0.0.1"
        hostnames:
        - "service-a.multi-tenant.svc.cluster.local"

This means that OPAL Server can give out the same service FQDN to service A, service B, and service C, but service A won't be dependent on its own service fqdn (which won't be available until that pod's OPAL Clients livenessProbe succeeds). This kind of setup doesn't seem like it would be unique to what we're doing so I'd be interested to know if there's an obvious design issue here. To give a bit more context Service A is responsible for serving user roles/permissions - these are used as policy data which is fed into OPA via OPAL. Service B has ABAC-style policy data that is also fed into OPA via OPAL. And then both of those services' public APIs have policy checks using OPA from their sidecar. I have a cluster-internal API on both of these services that serves data in the correct JSON format to store into OPA via OPAL. Is this an unusual setup? It doesn't seem particularly unusual but anyone using this kind of setup would surely hit the same issue of a service needing a sidecar that needs data from itself?

Ben Wallis

07/26/2023, 12:18 PM

@Raz Co do you know if the above issue of failed healthchecks returning HTTP 200 is planned to be fixed?

Raz Co

07/26/2023, 8:05 PM

Hey Ben. Our team putting a lot of resources now on the next major release of OPAL, that will include a lot of changes and actually most of the components in it will be replaced. If it’s something very urgent for you that might block you from deploying OPAL to production I can try to prioritize this with the team. Please let me know your status and how critical it is for you.

Raz Co

07/26/2023, 8:06 PM

Also, sorry for missing your last message, do you still need our help with this?

Ben Wallis

07/26/2023, 8:07 PM

It's not a major issue - I just came across the workaround we put in place when writing some internal documentation and was just seeing if there's an update. Is there any published information about the changes in the next release of OPAL that you mentioned? How extensive are the changes?

Open in Slack

Previous Next