Anyone upgraded to 0.8.28 and seeing the `mae-cons...
# troubleshoot
g
Anyone upgraded to 0.8.28 and seeing the
mae-consumer
and
mce-consumer
failing? It looks like the springboot app starts fine, but /actuator/health returns 404. Reverting to 0.8.27 and they work fine.
plus1 1
g
were there any logs?
g
I tried comparing the working vs non-working logs and it seems to be the same. The pod’s health check fails although from the logs it seems like everything is fine. There is even a message about 3 endpoints being created on the /actuator path metrics, info, health.
Copy code
curl localhost:9090/actuator/health
{"timestamp":"2022-03-09T01:35:10.978+00:00","status":404,"error":"Not Found","message":"Not Found","path":"/actuator/health"}
g
out of curiosity, are the pods functional?
e.g., if you run ingestion will the entities be indexed by the mae consumer (even though its failing the health check?)
or is it non-functional?
also, are you running your gms container with
MAE_CONSUMER_ENABLED=false
? If not gms container will run mae consumer inside of it
that may be the problem 🤔
g
The helm chart is configured with global.datahub_standalone_consumers_enabled = true and is not setting MCE/MAE_CONSUMER_ENABLED=true
I will test whether they are working tomorrow, they don’t run very long though because they get kill after a few minutes because of the health check
They run for like 4 minutes or so but never reach ready
Copy code
datahub-datahub-mce-consumer-7cc5475595-c9cwk      0/1     Running   5 (3m21s ago)    20m
e
yeah the global.datahub_standalone_consumers_enabled = true auto sets the above env variables
This is very strange that both are failing and no real error msgs are coming out. We will also try some testing ourselves
g
Its probably something with my environment, however this is the first time I’ve run into this after running releases since 0.8.20
s
faced same issue with `mae`/`mce` while migrating 0.8.27 -> 0.8.28 any resoluiton for it?
e
We have reproduced as well. For some reason seems like the health endpoitns are not working. We will update you after further investigation
g
My theory is that its related to the spring libraries perhaps. Spring/spring-boot libraries are like a house of cards. There was a change made here where a later version of spring-boot was introduced. It might have broken something with maybe the spring boot autoconfigure of the endpoints. Not sure though.
e
We found the issue. While adding the openAPI servlet to GMS, we set the following, which got picked up by the consumer jobs as well. This made the actuator expose /openapi/actuator/health instead of /actuator/health 😞
Copy code
spring:
  mvc:
    servlet:
      path: /openapi
we are sending out a fix now. will create a new release afterwards
sorry about the issue and thanks for reporting it!!
cc @orange-night-91387 who is working on getting the fix out!
1
thank you 2
o
g
Hey @gentle-night-56466 , @shy-parrot-64120 - the fix has just been released in Datahub 0.8.29!
🚀 1
David - your Java 11 changes are in that release as well.
thank you 1
g
@green-football-43791 - I believe introducing the java 11 change is resulting in a timeout when the gradle 6 toolchain is taking time to bootstap the java 11 jdk. This doesn’t seem to happen locally when the jdk11 compiler is already present. The failures look like this
Copy code
> Task :metadata-integration:java:datahub-client:test

datahub.client.rest.RestEmitterTest > testTimeoutOnGet FAILED
    org.mockserver.client.SocketCommunicationException at RestEmitterTest.java:318
g
I see----
would bumping this timeout resolve the issue?
if so, do you have a recommendation for the timeout duration?
g
or in any case, something happened around the jdk11 change to increase it
g
im surprised this wouldn’t have come up in CI 😕
g
It did run several times at least, but I see two cases of this today
My other PR and another one that ran today
The jdk 11 changes did not directly touch that module, so maybe I’m just being paranoid. That said, it seems to be happening today.
g
ok — cc @careful-pilot-86309 - I believe you contributed this test
Mugdha would you be able to take a look here?
g
Both these PRs ran today and failed with the same condition: PR1, PR2
g
thanks for the heads up David
no problem 1
s
verifying on our env
works as a charm - thanks a lot folks
c
I checked and i think mock server is struggling to get port to start on. I have raised PR with the quick fix. @mammoth-bear-12532 Please review the PR. This is same thing what we did in spark-lineage test