Hi Team We are repeatedly losing our secret values that we e DataHub #troubleshoot

Hi Team! We are repeatedly losing our secret value...

white-guitar-82227

05/31/2023, 12:30 PM

Hi Team! We are repeatedly losing our secret values that we enter via the frontend. We suspect a container restart to be the cause for this, but we can’t tell for sure. Hard to imagine that this is common behavior given the static nature of secrets. The version we use is v0.10.2 and we use AWS OpenSearch/RDS/MSK as prerequisites.

astonishing-answer-96712

05/31/2023, 9:24 PM

Hmm, this is potentially a bug- did this start happening after an update or is it ongoing?

white-guitar-82227

06/02/2023, 1:59 PM

Hi @astonishing-answer-96712, I’m sorry it took a while, but I wanted to collect some more data to help narrow it down as early as possible in the process. So we updated to the most recent Helm Chart (Datahub v0.10.3) and ran our tests by restarting only one Pod at a time. It looks like restarting the GMS Pod triggers our issue. We basically ran the same ingestion multiple times after every Pod restart respectively. Now that GMS has been restarted we see this error:

Copy code

{'exec_id': '7529c81e-9fd3-4038-9702-9aa7035b2e78',
 'infos': ['2023-06-02 12:40:09.819865 INFO: Starting execution for task with name=RUN_INGEST',
           '2023-06-02 12:40:09.828182 INFO: Caught exception EXECUTING task_id=7529c81e-9fd3-4038-9702-9aa7035b2e78, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 69, in execute\n'
           '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_task_common.py", line 100, in _resolve_recipe\n'
           '    raise TaskError(f"Failed to resolve secret with name {match}. Aborting recipe execution.")\n'
           'acryl.executor.execution.task.TaskError: Failed to resolve secret with name SINK_TOKEN. Aborting recipe execution.\n'],
 'errors': []}

white-guitar-82227

06/02/2023, 2:00 PM

“SINK_TOKEN” is the secret we are observing and that we entered prior to our tests and was successfully consumed in past ingestion runs

white-guitar-82227

06/09/2023, 6:29 AM

@astonishing-answer-96712 any chance for a hint here? Thanks in advance

helpful-tent-87247

06/21/2023, 7:11 PM

I am also seeing this same issue - @astonishing-answer-96712 any ideas?

little-megabyte-1074

06/26/2023, 5:50 PM

@bulky-soccer-26729 any chance you have context here?

aloof-gpu-11378

06/26/2023, 7:53 PM

hmmm not really but I can try to repro this myself and see if i can see anything going on

aloof-gpu-11378

06/26/2023, 7:55 PM

@white-guitar-82227 / @helpful-tent-87247 - have you seen this with any other upgrade or just upgrading to 0.10.3?

helpful-tent-87247

06/26/2023, 7:56 PM

we're on v0.9.6.1

white-guitar-82227

06/27/2023, 6:38 AM

We’ve encountered it on 0.10.2 and 0.10.3. The upgrade from 0.10.2 to 0.10.3 didn’t change anything for us. What triggered the loss of the secret was a Pod restart

white-guitar-82227

06/27/2023, 1:39 PM

@bulky-soccer-26729 - we tried to rerun our ingestion and now encounter an issue where he cannot find a secret (“SINK_TOKEN”) that we freshly entered at all.

Copy code

source:
    type: mongodb
    config:
        connect_uri: 'xxxxxxx'
        username: '${USER}'
        password: '${PASSWORD}'
        enableSchemaInference: true
        useRandomSampling: true
        maxSchemaSize: 500
        database_pattern:
            allow:
                - xxxxxx
            deny:
                - admin|local|config|system
            ignoreCase: true
        collection_pattern:
            allow:
                - xxxxxx
            deny:
                - 'system.*'
            ignoreCase: true
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-app-datahub-gms:8080>'
        token: '${SINK_TOKEN}'

This is our ingestion. The error:

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': 'e4f45ba9-675a-4080-83ca-dd96948d247f',
 'infos': ['2023-06-27 13:15:26.499701 INFO: Starting execution for task with name=RUN_INGEST',
           '2023-06-27 13:15:26.508139 INFO: Caught exception EXECUTING task_id=e4f45ba9-675a-4080-83ca-dd96948d247f, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 69, in execute\n'
           '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_task_common.py", line 100, in _resolve_recipe\n'
           '    raise TaskError(f"Failed to resolve secret with name {match}. Aborting recipe execution.")\n'
           'acryl.executor.execution.task.TaskError: Failed to resolve secret with name SINK_TOKEN. Aborting recipe execution.\n'],
 'errors': []}

~~~~ Ingestion Logs ~~~~

white-guitar-82227

06/27/2023, 1:42 PM

The exact error message from the GMS pod is:

Copy code

2023-06-27 13:15:26,505 [ForkJoinPool.commonPool-worker-23] 
  ERROR c.datahub.graphql.GraphQLController:107 - 
  Errors while executing graphQL query: \"query getSecretValues($input: GetSecretValuesInput!) {
  getSecretValues(input: $input) { name value  } }\",
  result:
  {errors=[{message=An unknown error occurred., 
  locations=[{line=3,column=17}], 
   path=[getSecretValues], 
   extensions={code=500, 
   type=SERVER_ERROR, 
   classification=DataFetchingException}}], 
   data={getSecretValues=null}, 
   extensions={tracing={version=1, 
   startTime=2023-06-27T13:15:26.501925Z, 
   endTime=2023-06-27T13:15:26.505901Z, 
   duration=3978201, 
   parsing={startOffset=244000, 
   duration=211824}, 
   validation={startOffset=425286, duration=162932}, 
   execution={resolvers=[{path=[getSecretValues], parentType=Query, returnType=[SecretValue!], 
   fieldName=getSecretValues, 
   startOffset=507388, 
   duration=2968027}]}}}}, 
   errors: [DataHubGraphQLError{path=[getSecretValues], code=SERVER_ERROR, locations=[SourceLocation{line=3, column=17}]}]",

aloof-gpu-11378

06/27/2023, 1:42 PM

okay thank you all! this is a big help. Peter, are you able to see the secret in the UI still?

white-guitar-82227

06/27/2023, 1:42 PM

Yes, all the time

white-guitar-82227

06/27/2023, 1:43 PM

I am about to connect to the DB with DBeaver to see if there is anything interesting in there

white-guitar-82227

06/27/2023, 1:43 PM

I mean helpful in that regard

bulky-soccer-26729

06/27/2023, 1:43 PM

that would be great

white-guitar-82227

06/27/2023, 1:44 PM

I will need a bit. It’s RDS and I need to modify SecurityGroups etc

white-guitar-82227

06/27/2023, 1:44 PM

Thank you very much for your assistance so far

aloof-gpu-11378

06/27/2023, 1:45 PM

totally, and no problem thanks for your patience on us being able to help out!

white-guitar-82227

06/27/2023, 1:54 PM

There is only one table “metadata_aspect_v2” in the database “datahub” under the “public” schema. I would expect the secrets to be in there somewhere, but they aren’t

white-guitar-82227

06/27/2023, 1:57 PM

I am using the same database user that was configured for Datahub and also used by the postgres setup job

white-guitar-82227

06/27/2023, 2:07 PM

Entered a fresh new secret. The only log line I see is as follows:

Copy code

2023-06-27 14:04:13,805 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 3 Took time ms: -1

aloof-gpu-11378

06/27/2023, 2:09 PM

hmm so in mysql, you don't see something starting with the urn

urn:li:dataHubSecret:

aloof-gpu-11378

06/27/2023, 2:09 PM

if you're seeing it in the UI it should be in the db

white-guitar-82227

06/27/2023, 2:10 PM

We use Postgres, but that should not matter I guess

aloof-gpu-11378

06/27/2023, 2:11 PM

yeah that shouldn't matter

white-guitar-82227

06/27/2023, 2:17 PM

yes, it’s in there

white-guitar-82227

06/27/2023, 2:18 PM

And so is SINK_TOKEN

aloof-gpu-11378

06/27/2023, 2:19 PM

okay cool so that's good.

aloof-gpu-11378

06/27/2023, 2:19 PM

also if it's showing up in your UI, then it should be in elasticsearch since we use search under the hood to list your secrets

white-guitar-82227

06/27/2023, 2:20 PM

what HTTP request would I use to obtain it?

aloof-gpu-11378

06/27/2023, 2:23 PM

the request we make through the ui is the graphql endpoint

listSecrets

docs here: https://datahubproject.io/docs/graphql/queries/#listsecrets

aloof-gpu-11378

06/27/2023, 2:24 PM

but if you want to look directly at elasticsearch, we use a tool called elasticvue: https://elasticvue.com/

aloof-gpu-11378

06/27/2023, 2:24 PM

when are you getting that GMS error you listed about around

getSecretValues

white-guitar-82227

06/27/2023, 2:29 PM

when running an ingestion

white-guitar-82227

06/27/2023, 2:30 PM

i can access my ES via curl

white-guitar-82227

06/27/2023, 2:37 PM

ok, connected to ES via elasticvue

aloof-gpu-11378

06/27/2023, 2:37 PM

okay gotcha. do you have any more logs around that error in gms get calling

getSecretValues

white-guitar-82227

06/27/2023, 2:38 PM

ES is empty it seems

aloof-gpu-11378

06/27/2023, 2:40 PM

hmm that doesn't seem right. if you can search for things and your UI is looking normal, ES should be populated

white-guitar-82227

06/27/2023, 2:42 PM

Copy code

2023-06-27 13:22:10,352 [ForkJoinPool.commonPool-worker-27] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:21 - Failed to execute DataFetcher
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@7125083c
        at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
        at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@7125083c
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:87)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
        ... 6 common frames omitted
Caused by: java.lang.RuntimeException: Failed to decrypt value using provided secret!
        at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:80)
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.decryptSecret(GetSecretValuesResolver.java:95)
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$1(GetSecretValuesResolver.java:77)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:85)
        ... 7 common frames omitted
Caused by: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.
        at java.base/com.sun.crypto.provider.CipherCore.unpad(CipherCore.java:975)
        at java.base/com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1056)
        at java.base/com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
        at java.base/com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
        at java.base/javax.crypto.Cipher.doFinal(Cipher.java:2202)
        at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:78)
        ... 17 common frames omitted
2023-06-27 13:22:10,352 [ForkJoinPool.commonPool-worker-1] ERROR c.datahub.graphql.GraphQLController:107 - Errors while executing graphQL query: "query getSecretValues($input: GetSecretValuesInput!) {\n\n                getSecretValues(input: $input) {\n\n                    name\n\n                    value\n\\
n                }\n\n            }", result: {errors=[{message=An unknown error occurred., locations=[{line=3, column=17}], path=[getSecretValues], extensions={code=500, type=SERVER_ERROR, classification=DataFetchingException}}], data={getSecretValues=null}, extensions={tracing={version=1, startTime=2023-06-27\
T13:22:10.349460Z, endTime=2023-06-27T13:22:10.352648Z, duration=3190479, parsing={startOffset=185674, duration=154469}, validation={startOffset=400365, duration=195731}, execution={resolvers=[{path=[getSecretValues], parentType=Query, returnType=[SecretValue!], fieldName=getSecretValues, startOffset=483188, du\
ration=2147348}]}}}}, errors: [DataHubGraphQLError{path=[getSecretValues], code=SERVER_ERROR, locations=[SourceLocation{line=3, column=17}]}]
2023-06-27 13:22:10,356 [qtp1645547422-520] INFO  c.l.m.r.entity.AspectResource:171 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...6237227d)}, entityType=dataHubExecutionRequest, aspect={contentTyp\
e=application/json, value=ByteString(length=1582,bytes=7b227374...2032327d)}, changeType=UPSERT}
2023-06-27 13:22:10,369 [pool-13-thread-13] INFO  c.l.m.filter.RestliLoggingFilter:55 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 13ms
2023-06-27 13:22:10,774 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 8 Took time ms: -1

white-guitar-82227

06/27/2023, 2:43 PM

AWS metrics say there are approx 6k documents. Maybe I am using elasticvue wrong

white-guitar-82227

06/27/2023, 2:45 PM

OK. Now I see many indices (via curl)

white-guitar-82227

06/27/2023, 2:48 PM

I do see all my secrets in ES

helpful-tent-87247

07/05/2023, 1:51 PM

so was this ever resolved?

white-guitar-82227

07/05/2023, 1:51 PM

nope 😞

helpful-tent-87247

07/05/2023, 1:51 PM

damn

white-guitar-82227

07/05/2023, 1:51 PM

If you look at my last log snippet - there is an error message that might be a hint

white-guitar-82227

07/05/2023, 1:52 PM

I am not familiar enough with Datahub internals to understand it in detail

helpful-tent-87247

07/05/2023, 1:52 PM

@astonishing-answer-96712 @bulky-soccer-26729 @little-megabyte-1074 would it be possible to get your team to create a bug ticket to look into this?

aloof-gpu-11378

07/05/2023, 1:53 PM

hey! yes definitely creating a bug ticket now for us to review internally. i appreciate all of your patience on this, i'll keep you updated as we dig into it more

white-guitar-82227

07/05/2023, 1:53 PM

Thank you @bulky-soccer-26729 & @helpful-tent-87247

aloof-gpu-11378

07/05/2023, 1:54 PM

in the meantime, Peter, if you are able to recreate that and share GMS error from your last message, but I believe there should be more information above about where this error is coming from, that would be real helpful. that looks like just the graphql error, but there should be lower level logging about what's going on before that

white-guitar-82227

07/05/2023, 1:56 PM

The above pasted log is from the GMS log

white-guitar-82227

07/05/2023, 2:02 PM

Would our Helm values be helpful?

aloof-gpu-11378

07/05/2023, 2:04 PM

i don't think it would hurt if you have those easily availbel

white-guitar-82227

07/06/2023, 8:26 AM

Values for prerequisites:

Copy code

elasticsearch:
  enabled: false
neo4j:
  enabled: false
neo4j-community:
  enabled: false
mysql:
  enabled: false
postgresql:
  enabled: false
cp-helm-charts:
  enabled: true
  cp-schema-registry:
    enabled: true
    resources:
      requests:
        cpu: "100m"
        memory: "512Mi"
      limits:
        memory: "512Mi"
    kafka:
      bootstrapServers: "<http://b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092|b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092>"
  cp-kafka:
    enabled: false
  cp-zookeeper:
    enabled: false
  cp-kafka-rest:
    enabled: false
  cp-kafka-connect:
    enabled: false
  cp-ksql-server:
    enabled: false
  cp-control-center:
    enabled: false
kafka:
  enabled: false

white-guitar-82227

07/06/2023, 8:26 AM

Values for Datahub itself:

Copy code

elasticsearch:
  enabled: false
neo4j:
  enabled: false
neo4j-community:
  enabled: false
mysql:
  enabled: false
postgresql:
  enabled: false
cp-helm-charts:
  enabled: true
  cp-schema-registry:
    enabled: true
    resources:
      requests:
        cpu: "100m"
        memory: "512Mi"
      limits:
        memory: "512Mi"
    kafka:
      bootstrapServers: "<http://b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092|b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092>"
  cp-kafka:
    enabled: false
  cp-zookeeper:
    enabled: false
  cp-kafka-rest:
    enabled: false
  cp-kafka-connect:
    enabled: false
  cp-ksql-server:
    enabled: false
  cp-control-center:
    enabled: false
kafka:
  enabled: false
peter@MacBookPro datahub % cat datahub-app-values.yaml          
datahub-gms:
  enabled: true
  service:
    type: ClusterIP
  image:
    repository: linkedin/datahub-gms
  extraEnvs:
    - name: METADATA_SERVICE_AUTH_ENABLED
      value: 'true'
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: 100m
      memory: 4Gi
datahub-frontend:
  enabled: true
  image:
    repository: linkedin/datahub-frontend-react
  extraEnvs:
    - name: METADATA_SERVICE_AUTH_ENABLED
      value: 'true'
  resources:
    limits:
      memory: 1400Mi
    requests:
      cpu: 100m
      memory: 1400Mi
  ingress:
    enabled: true
    annotations:
      <http://alb.ingress.kubernetes.io/ssl-redirect|alb.ingress.kubernetes.io/ssl-redirect>: '443'
      <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: 'arn:aws:acm:eu-central-1:xxxxxx''
      <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: infrastructure
      <http://alb.ingress.kubernetes.io/healthcheck-path|alb.ingress.kubernetes.io/healthcheck-path>: '/admin'
      <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: '[{"HTTP": 80}, {"HTTPS":443}]'
      <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: 'internal'
      <http://alb.ingress.kubernetes.io/target-type|alb.ingress.kubernetes.io/target-type>: 'ip'
      <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: 'alb'
    hosts:
      - host: datahub.example.local
        paths:
          - '/*'
  extraVolumes:
    - name: datahub-users
      secret:
       defaultMode: 0444
       secretName: datahub-users-secret
  extraVolumeMounts:
    - name: datahub-users
      mountPath: /datahub-frontend/conf/user.props
      subPath: user.props
  service:
    type: "ClusterIP"
acryl-datahub-actions:
  enabled: true
  image:
    repository: acryldata/datahub-actions
    tag: "v0.0.11"
  resources:
    limits:
      memory: 2Gi
    requests:
      cpu: 300m
      memory: 2Gi
datahub-mae-consumer:
  image:
    repository: linkedin/datahub-mae-consumer
  resources:
    limits:
      memory: 1536Mi
    requests:
      cpu: 100m
      memory: 1536Mi
datahub-mce-consumer:
  image:
    repository: linkedin/datahub-mce-consumer
  resources:
    limits:
      memory: 1536Mi
    requests:
      cpu: 100m
      memory: 1536Mi
datahub-ingestion-cron:
  enabled: false
  image:
    repository: acryldata/datahub-ingestion
elasticsearchSetupJob:
  enabled: true
  image:
    repository: linkedin/datahub-elasticsearch-setup
  extraEnvs:
    - name: USE_AWS_ELASTICSEARCH
      value: "true"
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 512Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
kafkaSetupJob:
  enabled: true
  image:
    repository: linkedin/datahub-kafka-setup
  resources:
    limits:
      cpu: 500m
      memory: 1024Mi
    requests:
      cpu: 300m
      memory: 1024Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
mysqlSetupJob:
  enabled: false
  image:
    repository: acryldata/datahub-mysql-setup
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 512Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
postgresqlSetupJob:
  enabled: true
  image:
    repository: acryldata/datahub-postgres-setup
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 512Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
datahubUpgrade:
  enabled: true
  image:
    repository: acryldata/datahub-upgrade
  batchSize: 1000
  batchDelayMs: 100
  noCodeDataMigration:
    sqlDbType: "POSTGRES"
  podSecurityContext: {}
  securityContext: {}
  podAnnotations: {}
  restoreIndices:
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 300m
        memory: 512Mi
global:
  strict_mode: true
  graph_service_impl: elasticsearch
  datahub_analytics_enabled: false
  datahub_standalone_consumers_enabled: false
  elasticsearch:
    host: "<http://vpc-datahub-xxxxxxxxxxxxxxxxxxxxxxxxxx.eu-central-1.es.amazonaws.com|vpc-datahub-xxxxxxxxxxxxxxxxxxxxxxxxxx.eu-central-1.es.amazonaws.com>"
    port: "443"
    skipcheck: "false"
    insecure: "false"
    useSSL: "true"
    region: eu-central-1
    auth:
      username: datahub
      password:
        secretRef: opensearch-secret
        secretKey: opensearch-password
    index:
      enableMappingsReindex: true
      enableSettingsReindex: true
      upgrade:
        cloneIndices: true
        allowDocCountMismatch: false
    search:
      maxTermBucketSize: 20
      exactMatch:
        exclusive: false
        withPrefix: true
        exactFactor: 2.0
        prefixFactor: 1.6
        caseSensitivityFactor: 0.7
        enableStructured: true
      graph:
        timeoutSeconds: 50
        batchSize: 1000
        maxResult: 10000
  kafka:
    bootstrap:
      server: "<http://b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092|b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092>"
    zookeeper:
      server: "<http://z-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181|z-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181>"
    topics:
      metadata_change_event_name: "MetadataChangeEvent_v4"
      failed_metadata_change_event_name: "FailedMetadataChangeEvent_v4"
      metadata_audit_event_name: "MetadataAuditEvent_v4"
      datahub_usage_event_name: "DataHubUsageEvent_v1"
      metadata_change_proposal_topic_name: "MetadataChangeProposal_v1"
      failed_metadata_change_proposal_topic_name: "FailedMetadataChangeProposal_v1"
      metadata_change_log_versioned_topic_name: "MetadataChangeLog_Versioned_v1"
      metadata_change_log_timeseries_topic_name: "MetadataChangeLog_Timeseries_v1"
      platform_event_topic_name: "PlatformEvent_v1"
      datahub_upgrade_history_topic_name: "DataHubUpgradeHistory_v1"
    schemaregistry:
      url: "<http://datahub-prerequisites-cp-schema-registry:8081>"
      type: KAFKA
    paritions: 3
    replicationFactor: 3
  neo4j:
    host: "prerequisites-neo4j-community:7474"
    uri: "<bolt://prerequisites-neo4j-community>"
    username: "neo4j"
    password:
      secretRef: neo4j-secrets
      secretKey: neo4j-password
  sql:
    datasource:
      host: "datahubdb.example.local:5432"
      hostForpostgresqlClient: "datahubdb.example.local"
      port: "5432"
      url: "jdbc:<postgresql://datahubdb.example.local:5432/datahub>"
      driver: "org.postgresql.Driver"
      username: "svc_datahub"
      password:
        secretRef: postgres-secret
        secretKey: postgres-root-password
  datahub:
    gms:
      port: "8080"
      nodePort: "30001"
    monitoring:
      enablePrometheus: true
    mae_consumer:
      port: "9091"
      nodePort: "30002"
    encryptionKey:
      secretRef: "datahub-encryption-secrets"
      secretKey: "encryption_key_secret"
      provisionSecret:
        enabled: true
        autoGenerate: true
    managed_ingestion:
      enabled: true
    metadata_service_authentication:
      enabled: false
      systemClientId: "__datahub_system"
      systemClientSecret:
        secretRef: "datahub-auth-secrets"
        secretKey: "token_service_signing_key"
      tokenService:
        signingKey:
          secretRef: "datahub-auth-secrets"
          secretKey: "token_service_signing_key"
        salt:
          secretRef: "datahub-auth-secrets"
          secretKey: "token_service_salt"
      provisionSecrets:
        enabled: true
        autoGenerate: true
    alwaysEmitChangeLog: true
    enableGraphDiffMode: true

white-guitar-82227

07/06/2023, 8:35 AM

We use AWS RDS (Postgres), MSK and OpenSearch and only need the schema-registry from datahub-prerequisites

helpful-tent-87247

07/07/2023, 4:02 PM

please keep this thread posted on findings and potential resolution!

incalculable-ocean-74010

07/10/2023, 9:30 AM

Hello Peter, How are you deploying these helm charts? For secret provisioning our helm charts default to a

lookup

method which requires access to the underlying Kubernetes cluster. If that access is somehow not direct then it will not work. To fix this issue I would recommend setting:

global.datahub.encryptionKey.provisionSecret.autoGenerate

and

global.datahub.metadata_service_authentication.provisionSecrets.autoGenerate

to false. This will force to either specify the secret values in the `values.yaml`file OR provision the secret yourself and reference it like: https://github.com/acryldata/datahub-helm/blob/d56333b25996172ae68c01b4aa2f3d0d7de51b05/charts/datahub/values.yaml#L576 For more information on the helm lookup function, see: https://helm.sh/docs/chart_template_guide/functions_and_pipelines/#using-the-lookup-function

white-guitar-82227

07/10/2023, 11:48 AM

Hi @incalculable-ocean-74010, thank you for following up. We deploy Datahub via ArgoCD and all secrets (I mean postgres and ES) are created via ExternalSecretsOperator. The secret “datahub-encryption-secrets” is present on our cluster and I guess previously created via that autoGenerate setting/feature

incalculable-ocean-74010

07/10/2023, 2:00 PM

Correct. In which case you should make the encryption secret an external secret and disable auto generation + secret provisioning

white-guitar-82227

07/10/2023, 2:12 PM

OK. will do so and get back to you ASAP

white-guitar-82227

07/10/2023, 2:42 PM

What about the datahub-app-gms-secret?

white-guitar-82227

07/10/2023, 2:43 PM

BTW, I applied the changes you suggested. datahub-encryption-secret is now a separately deployed secret

helpful-tent-87247

07/11/2023, 1:02 AM

@white-guitar-82227 can you tell if this resolved the issue

white-guitar-82227

07/11/2023, 1:23 PM

@helpful-tent-87247 we now have rerun our tests. Restarted the GMS pod. Entered the SINK_TOKEN. Ran ingestions. Restarted the GMS pod again and reran the tests again to see if the Pod restart killed the SINK_TOKEN secret. A postgres ingestion ran fine both times. A MongoDB ingestion failed the second time due to a missing SINK_TOKEN (the same token that the postgres ingestion ran fine with). The GMS log output:

Copy code

2023-07-11 13:18:31,250 [qtp1645547422-279] INFO  c.l.m.r.entity.AspectResource:171 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...6138227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=51,bytes=7b227374...3234367d)}, changeType=UPSERT}
2023-07-11 13:18:31,261 [pool-13-thread-5] INFO  c.l.m.filter.RestliLoggingFilter:55 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 11ms
2023-07-11 13:18:31,269 [ForkJoinPool.commonPool-worker-19] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:21 - Failed to execute DataFetcher
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@6e4557d4
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@6e4557d4
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:87)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
	... 6 common frames omitted
Caused by: java.lang.RuntimeException: Failed to decrypt value using provided secret!
	at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:80)
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.decryptSecret(GetSecretValuesResolver.java:95)
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$1(GetSecretValuesResolver.java:77)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:85)
	... 7 common frames omitted
Caused by: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.
	at java.base/com.sun.crypto.provider.CipherCore.unpad(CipherCore.java:975)
	at java.base/com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1056)
	at java.base/com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
	at java.base/com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
	at java.base/javax.crypto.Cipher.doFinal(Cipher.java:2202)
	at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:78)
	... 17 common frames omitted
2023-07-11 13:18:31,269 [ForkJoinPool.commonPool-worker-23] ERROR c.datahub.graphql.GraphQLController:107 - Errors while executing graphQL query: "query getSecretValues($input: GetSecretValuesInput!) {\n\n                getSecretValues(input: $input) {\n\n                    name\n\n                    value\n\n                }\n\n            }", result: {errors=[{message=An unknown error occurred., locations=[{line=3, column=17}], path=[getSecretValues], extensions={code=500, type=SERVER_ERROR, classification=DataFetchingException}}], data={getSecretValues=null}, extensions={tracing={version=1, startTime=2023-07-11T13:18:31.265782Z, endTime=2023-07-11T13:18:31.269727Z, duration=3947655, parsing={startOffset=330873, duration=292201}, validation={startOffset=572721, duration=220571}, execution={resolvers=[{path=[getSecretValues], parentType=Query, returnType=[SecretValue!], fieldName=getSecretValues, startOffset=670785, duration=2645164}]}}}}, errors: [DataHubGraphQLError{path=[getSecretValues], code=SERVER_ERROR, locations=[SourceLocation{line=3, column=17}]}]
2023-07-11 13:18:31,274 [qtp1645547422-279] INFO  c.l.m.r.entity.AspectResource:171 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...6138227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=1568,bytes=7b227374...2032377d)}, changeType=UPSERT}
2023-07-11 13:18:31,290 [pool-13-thread-4] INFO  c.l.m.filter.RestliLoggingFilter:55 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 16ms
2023-07-11 13:18:31,691 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 7 Took time ms: -1

helpful-tent-87247

07/11/2023, 1:54 PM

very interesting that the same secret was available for one ingestion job and not for another

white-guitar-82227

07/11/2023, 1:54 PM

Indeed

helpful-tent-87247

07/11/2023, 1:54 PM

ok so i'm going to set these values:

Copy code

encryptionKey:
      secretRef: "datahub-encryption-secrets"
      secretKey: "encryption_key_secret"
      # Set to false if you'd like to provide your own secret.
      provisionSecret:
        enabled: false
        autoGenerate: false
        annotations: {}
      # Only specify if autoGenerate set to false
      #  secretValues:
      #    encryptionKey: <encryption key value>

helpful-tent-87247

07/11/2023, 1:55 PM

but was there anything else you did to make sure the datahub-encryption-secret was available as an external secret?

white-guitar-82227

07/11/2023, 1:56 PM

I reused the secret value from the time the secret was auto-generated. It’s just that the secret object itself was now deployed externally

white-guitar-82227

07/11/2023, 1:56 PM

The ingestion YAML is exactly the same in both ingestions

white-guitar-82227

07/14/2023, 6:57 AM

@helpful-tent-87247 is there anything you’d need from me apart from the above provided information?

helpful-tent-87247

07/14/2023, 9:52 PM

we fixed it

white-guitar-82227

08/04/2023, 6:12 AM

Hi @helpful-tent-87247, does version 0.10.5 contain this fix? So far we cannot confirm out problem got resolved

witty-night-28872

09/15/2023, 12:15 PM

Hi @helpful-tent-87247 - is this fix part of Version 0.11.0 ? We (@white-guitar-82227 is a colleague of mine) were still seeing the issue with v 0.10.5

bumpy-manchester-97826

10/25/2023, 1:59 PM

Hi @helpful-tent-87247, can you elaborate on what you did? Just simply saying you fixed it and nothing else really doesn’t help anyone else who is facing this issue.

white-guitar-82227

11/10/2023, 9:44 AM

@bumpy-manchester-97826 as for our part we can confirm that the problem is gone with release v0.11.0

bumpy-manchester-97826

11/16/2023, 10:42 AM

Thanks for the reply @white-guitar-82227 We’re on v0.12.0 now and everything woking as expected

25 Views

Open in Slack

Previous Next