This message was deleted.
# opal
s
This message was deleted.
r
Hey @Ben Wallis, I’m taking a look at this, thank you for sharing the log, it’s very useful.
b
No problem, thanks for looking into it 🙂
r
It does look like the client gets the
OPA transaction failed
because of the connection failure to the API. That means that basically OPAL Client couldn’t load the initial data to OPA. What happens when you try to deploy the Client separately, while the API is ready to serve requests ? Depending on the answer here, you might want to change the retry amount of the data updater, or add lifecycle hooks to the containers (need to research on how to implement it in this case).
b
It always works if the first try succeeds - in my particular case I have 3 services each with their own pods which each have an OPAL Client sidecar container within their pods - 2 of those services have APIs referenced in
OPAL_DATA_CONFIG_SOURCES
. I can't ever guarantee which other pods will be available at the point that OPAL Client starts. According to the docs here https://docs.opal.ac/tutorials/healthcheck_policy_and_update_callbacks#-opa-healthcheck-policy if I enable the health check policy it'll only return
ready
when the initial data sources from
OPAL_DATA_CONFIG_SOURCES
have been successfully synced. Doesn't this suggest that there should be infinite retries until this succeeds? Otherwise the pod would be stuck with a failed healthcheck indefinitely.
Sidenote, I just tried to implement the readiness probe in my k8s config, but the ready endpoint
/v1/data/system/opal/healthy
returns HTTP 200 with a content of
{"result":false}
when it's not ready - this means this endpoint can't be used as a http readinessProbe target since it treats HTTP 200 as "ready"
I've worked around this with a derived Dockerfile that installs
curl
which enables me to configure a readinessProbe like so:
Copy code
readinessProbe:
  exec:
    command:
    - sh
    - -c
    - curl -s <http://localhost:8181/v1/data/system/opal/healthy> | grep 'true'
  failureThreshold: 10
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 5
This doesn't solve the problem though - it just means I can now correctly identify the application pod as unsuitable for usage because the OPA data hasn't been populated.
So I think I understand the issue more now - the reason it gives up after 10 seconds is due to the default value of
OPAL_FETCHING_CALLBACK_TIMEOUT
- it doesn't care if it succeeds after this time which is why the OPA transaction fails. However, I think I have a bigger problem. As I mentioned earlier some of my services with OPAL Client sidecar containers are the source of part of their own authorization data - I have the fetcher config configured with k8s service addresses, which causes a problem for the OPAL Client sidecar that needs to fetch from the service within the same pod as it. When doing a new deployment the cluster service DNS is still pointing to the previous instance of the pod because the new one isn't ready yet. I think I might need to think about using topics to allow those services to use
localhost
instead of a cluster dns address. Perhaps I'm doing something more fundamentally wrong with this design though 😅
r
First of all, the thing with healthy returns 200 is a problem, i’ll prioritize this asap. The thing with your deployment can be easily solved with
localhost
like you said. There is no reason to set 2 (or more) containers within one Pod and communicate between them using the load-balancing component (service or deployment fqdn), that’s a bad practice when it comes to sidecar containers.
b
I found the problem - it was directly related to having the OPAL Client container in a container within the same pod as a service that it needs to query data from. That service container had
initialDelaySeconds
set to 30 meaning it was never available before
OPAL_FETCHING_CALLBACK_TIMEOUT
expired. As for why I'm using the service fqdn, the initial configuration as specified in
OPAL_DATA_CONFIG_SOURCES
comes from OPAL Server right? There's only one instance of that configuration, and if I have Service A, Service B and Service C that all have OPAL Client sidecars that need data from Service A then I need a URL in that data config that can be accessed by all 3 services, not just Service A, so localhost wouldn't work
I've solved the problem with a hack - by using
hostAliases
I can configure a pod to resolve its own service fqdn to 127.0.0.1:
Copy code
hostAliases:
      - ip: "127.0.0.1"
        hostnames:
        - "service-a.multi-tenant.svc.cluster.local"
This means that OPAL Server can give out the same service FQDN to service A, service B, and service C, but service A won't be dependent on its own service fqdn (which won't be available until that pod's OPAL Clients livenessProbe succeeds). This kind of setup doesn't seem like it would be unique to what we're doing so I'd be interested to know if there's an obvious design issue here. To give a bit more context Service A is responsible for serving user roles/permissions - these are used as policy data which is fed into OPA via OPAL. Service B has ABAC-style policy data that is also fed into OPA via OPAL. And then both of those services' public APIs have policy checks using OPA from their sidecar. I have a cluster-internal API on both of these services that serves data in the correct JSON format to store into OPA via OPAL. Is this an unusual setup? It doesn't seem particularly unusual but anyone using this kind of setup would surely hit the same issue of a service needing a sidecar that needs data from itself?
@Raz Co do you know if the above issue of failed healthchecks returning HTTP 200 is planned to be fixed?
r
Hey Ben. Our team putting a lot of resources now on the next major release of OPAL, that will include a lot of changes and actually most of the components in it will be replaced. If it’s something very urgent for you that might block you from deploying OPAL to production I can try to prioritize this with the team. Please let me know your status and how critical it is for you.
Also, sorry for missing your last message, do you still need our help with this?
b
It's not a major issue - I just came across the workaround we put in place when writing some internal documentation and was just seeing if there's an update. Is there any published information about the changes in the next release of OPAL that you mentioned? How extensive are the changes?