Hi Team! We are repeatedly losing our secret value...
# troubleshoot
w
Hi Team! We are repeatedly losing our secret values that we enter via the frontend. We suspect a container restart to be the cause for this, but we can’t tell for sure. Hard to imagine that this is common behavior given the static nature of secrets. The version we use is v0.10.2 and we use AWS OpenSearch/RDS/MSK as prerequisites.
a
Hmm, this is potentially a bug- did this start happening after an update or is it ongoing?
w
Hi @astonishing-answer-96712, I’m sorry it took a while, but I wanted to collect some more data to help narrow it down as early as possible in the process. So we updated to the most recent Helm Chart (Datahub v0.10.3) and ran our tests by restarting only one Pod at a time. It looks like restarting the GMS Pod triggers our issue. We basically ran the same ingestion multiple times after every Pod restart respectively. Now that GMS has been restarted we see this error:
Copy code
{'exec_id': '7529c81e-9fd3-4038-9702-9aa7035b2e78',
 'infos': ['2023-06-02 12:40:09.819865 INFO: Starting execution for task with name=RUN_INGEST',
           '2023-06-02 12:40:09.828182 INFO: Caught exception EXECUTING task_id=7529c81e-9fd3-4038-9702-9aa7035b2e78, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 69, in execute\n'
           '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_task_common.py", line 100, in _resolve_recipe\n'
           '    raise TaskError(f"Failed to resolve secret with name {match}. Aborting recipe execution.")\n'
           'acryl.executor.execution.task.TaskError: Failed to resolve secret with name SINK_TOKEN. Aborting recipe execution.\n'],
 'errors': []}
“SINK_TOKEN” is the secret we are observing and that we entered prior to our tests and was successfully consumed in past ingestion runs
@astonishing-answer-96712 any chance for a hint here? Thanks in advance
h
I am also seeing this same issue - @astonishing-answer-96712 any ideas?
l
@bulky-soccer-26729 any chance you have context here?
a
hmmm not really but I can try to repro this myself and see if i can see anything going on
@white-guitar-82227 / @helpful-tent-87247 - have you seen this with any other upgrade or just upgrading to 0.10.3?
h
we're on v0.9.6.1
w
We’ve encountered it on 0.10.2 and 0.10.3. The upgrade from 0.10.2 to 0.10.3 didn’t change anything for us. What triggered the loss of the secret was a Pod restart
@bulky-soccer-26729 - we tried to rerun our ingestion and now encounter an issue where he cannot find a secret (“SINK_TOKEN”) that we freshly entered at all.
Copy code
source:
    type: mongodb
    config:
        connect_uri: 'xxxxxxx'
        username: '${USER}'
        password: '${PASSWORD}'
        enableSchemaInference: true
        useRandomSampling: true
        maxSchemaSize: 500
        database_pattern:
            allow:
                - xxxxxx
            deny:
                - admin|local|config|system
            ignoreCase: true
        collection_pattern:
            allow:
                - xxxxxx
            deny:
                - 'system.*'
            ignoreCase: true
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-app-datahub-gms:8080>'
        token: '${SINK_TOKEN}'
This is our ingestion. The error:
Copy code
~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': 'e4f45ba9-675a-4080-83ca-dd96948d247f',
 'infos': ['2023-06-27 13:15:26.499701 INFO: Starting execution for task with name=RUN_INGEST',
           '2023-06-27 13:15:26.508139 INFO: Caught exception EXECUTING task_id=e4f45ba9-675a-4080-83ca-dd96948d247f, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 69, in execute\n'
           '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_task_common.py", line 100, in _resolve_recipe\n'
           '    raise TaskError(f"Failed to resolve secret with name {match}. Aborting recipe execution.")\n'
           'acryl.executor.execution.task.TaskError: Failed to resolve secret with name SINK_TOKEN. Aborting recipe execution.\n'],
 'errors': []}

~~~~ Ingestion Logs ~~~~
The exact error message from the GMS pod is:
Copy code
2023-06-27 13:15:26,505 [ForkJoinPool.commonPool-worker-23] 
  ERROR c.datahub.graphql.GraphQLController:107 - 
  Errors while executing graphQL query: \"query getSecretValues($input: GetSecretValuesInput!) {
  getSecretValues(input: $input) { name value  } }\",
  result:
  {errors=[{message=An unknown error occurred., 
  locations=[{line=3,column=17}], 
   path=[getSecretValues], 
   extensions={code=500, 
   type=SERVER_ERROR, 
   classification=DataFetchingException}}], 
   data={getSecretValues=null}, 
   extensions={tracing={version=1, 
   startTime=2023-06-27T13:15:26.501925Z, 
   endTime=2023-06-27T13:15:26.505901Z, 
   duration=3978201, 
   parsing={startOffset=244000, 
   duration=211824}, 
   validation={startOffset=425286, duration=162932}, 
   execution={resolvers=[{path=[getSecretValues], parentType=Query, returnType=[SecretValue!], 
   fieldName=getSecretValues, 
   startOffset=507388, 
   duration=2968027}]}}}}, 
   errors: [DataHubGraphQLError{path=[getSecretValues], code=SERVER_ERROR, locations=[SourceLocation{line=3, column=17}]}]",
a
okay thank you all! this is a big help. Peter, are you able to see the secret in the UI still?
w
Yes, all the time
I am about to connect to the DB with DBeaver to see if there is anything interesting in there
I mean helpful in that regard
b
that would be great
w
I will need a bit. It’s RDS and I need to modify SecurityGroups etc
Thank you very much for your assistance so far
a
totally, and no problem thanks for your patience on us being able to help out!
w
There is only one table “metadata_aspect_v2” in the database “datahub” under the “public” schema. I would expect the secrets to be in there somewhere, but they aren’t
I am using the same database user that was configured for Datahub and also used by the postgres setup job
Entered a fresh new secret. The only log line I see is as follows:
Copy code
2023-06-27 14:04:13,805 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 3 Took time ms: -1
a
hmm so in mysql, you don't see something starting with the urn
urn:li:dataHubSecret:
?
if you're seeing it in the UI it should be in the db
w
We use Postgres, but that should not matter I guess
a
yeah that shouldn't matter
w
yes, it’s in there
And so is SINK_TOKEN
a
okay cool so that's good.
also if it's showing up in your UI, then it should be in elasticsearch since we use search under the hood to list your secrets
w
what HTTP request would I use to obtain it?
a
the request we make through the ui is the graphql endpoint
listSecrets
docs here: https://datahubproject.io/docs/graphql/queries/#listsecrets
but if you want to look directly at elasticsearch, we use a tool called elasticvue: https://elasticvue.com/
when are you getting that GMS error you listed about around
getSecretValues
?
w
when running an ingestion
i can access my ES via curl
ok, connected to ES via elasticvue
a
okay gotcha. do you have any more logs around that error in gms get calling
getSecretValues
?
w
ES is empty it seems
a
hmm that doesn't seem right. if you can search for things and your UI is looking normal, ES should be populated
w
Copy code
2023-06-27 13:22:10,352 [ForkJoinPool.commonPool-worker-27] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:21 - Failed to execute DataFetcher
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@7125083c
        at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
        at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@7125083c
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:87)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
        ... 6 common frames omitted
Caused by: java.lang.RuntimeException: Failed to decrypt value using provided secret!
        at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:80)
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.decryptSecret(GetSecretValuesResolver.java:95)
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$1(GetSecretValuesResolver.java:77)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
        at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:85)
        ... 7 common frames omitted
Caused by: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.
        at java.base/com.sun.crypto.provider.CipherCore.unpad(CipherCore.java:975)
        at java.base/com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1056)
        at java.base/com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
        at java.base/com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
        at java.base/javax.crypto.Cipher.doFinal(Cipher.java:2202)
        at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:78)
        ... 17 common frames omitted
2023-06-27 13:22:10,352 [ForkJoinPool.commonPool-worker-1] ERROR c.datahub.graphql.GraphQLController:107 - Errors while executing graphQL query: "query getSecretValues($input: GetSecretValuesInput!) {\n\n                getSecretValues(input: $input) {\n\n                    name\n\n                    value\n\\
n                }\n\n            }", result: {errors=[{message=An unknown error occurred., locations=[{line=3, column=17}], path=[getSecretValues], extensions={code=500, type=SERVER_ERROR, classification=DataFetchingException}}], data={getSecretValues=null}, extensions={tracing={version=1, startTime=2023-06-27\
T13:22:10.349460Z, endTime=2023-06-27T13:22:10.352648Z, duration=3190479, parsing={startOffset=185674, duration=154469}, validation={startOffset=400365, duration=195731}, execution={resolvers=[{path=[getSecretValues], parentType=Query, returnType=[SecretValue!], fieldName=getSecretValues, startOffset=483188, du\
ration=2147348}]}}}}, errors: [DataHubGraphQLError{path=[getSecretValues], code=SERVER_ERROR, locations=[SourceLocation{line=3, column=17}]}]
2023-06-27 13:22:10,356 [qtp1645547422-520] INFO  c.l.m.r.entity.AspectResource:171 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...6237227d)}, entityType=dataHubExecutionRequest, aspect={contentTyp\
e=application/json, value=ByteString(length=1582,bytes=7b227374...2032327d)}, changeType=UPSERT}
2023-06-27 13:22:10,369 [pool-13-thread-13] INFO  c.l.m.filter.RestliLoggingFilter:55 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 13ms
2023-06-27 13:22:10,774 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 8 Took time ms: -1
AWS metrics say there are approx 6k documents. Maybe I am using elasticvue wrong
OK. Now I see many indices (via curl)
I do see all my secrets in ES
h
so was this ever resolved?
w
nope 😞
h
damn
w
If you look at my last log snippet - there is an error message that might be a hint
I am not familiar enough with Datahub internals to understand it in detail
h
@astonishing-answer-96712 @bulky-soccer-26729 @little-megabyte-1074 would it be possible to get your team to create a bug ticket to look into this?
a
hey! yes definitely creating a bug ticket now for us to review internally. i appreciate all of your patience on this, i'll keep you updated as we dig into it more
w
Thank you @bulky-soccer-26729 & @helpful-tent-87247
a
in the meantime, Peter, if you are able to recreate that and share GMS error from your last message, but I believe there should be more information above about where this error is coming from, that would be real helpful. that looks like just the graphql error, but there should be lower level logging about what's going on before that
w
The above pasted log is from the GMS log
Would our Helm values be helpful?
a
i don't think it would hurt if you have those easily availbel
w
Values for prerequisites:
Copy code
elasticsearch:
  enabled: false
neo4j:
  enabled: false
neo4j-community:
  enabled: false
mysql:
  enabled: false
postgresql:
  enabled: false
cp-helm-charts:
  enabled: true
  cp-schema-registry:
    enabled: true
    resources:
      requests:
        cpu: "100m"
        memory: "512Mi"
      limits:
        memory: "512Mi"
    kafka:
      bootstrapServers: "<http://b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092|b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092>"
  cp-kafka:
    enabled: false
  cp-zookeeper:
    enabled: false
  cp-kafka-rest:
    enabled: false
  cp-kafka-connect:
    enabled: false
  cp-ksql-server:
    enabled: false
  cp-control-center:
    enabled: false
kafka:
  enabled: false
Values for Datahub itself:
Copy code
elasticsearch:
  enabled: false
neo4j:
  enabled: false
neo4j-community:
  enabled: false
mysql:
  enabled: false
postgresql:
  enabled: false
cp-helm-charts:
  enabled: true
  cp-schema-registry:
    enabled: true
    resources:
      requests:
        cpu: "100m"
        memory: "512Mi"
      limits:
        memory: "512Mi"
    kafka:
      bootstrapServers: "<http://b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092|b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092>"
  cp-kafka:
    enabled: false
  cp-zookeeper:
    enabled: false
  cp-kafka-rest:
    enabled: false
  cp-kafka-connect:
    enabled: false
  cp-ksql-server:
    enabled: false
  cp-control-center:
    enabled: false
kafka:
  enabled: false
peter@MacBookPro datahub % cat datahub-app-values.yaml          
datahub-gms:
  enabled: true
  service:
    type: ClusterIP
  image:
    repository: linkedin/datahub-gms
  extraEnvs:
    - name: METADATA_SERVICE_AUTH_ENABLED
      value: 'true'
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: 100m
      memory: 4Gi
datahub-frontend:
  enabled: true
  image:
    repository: linkedin/datahub-frontend-react
  extraEnvs:
    - name: METADATA_SERVICE_AUTH_ENABLED
      value: 'true'
  resources:
    limits:
      memory: 1400Mi
    requests:
      cpu: 100m
      memory: 1400Mi
  ingress:
    enabled: true
    annotations:
      <http://alb.ingress.kubernetes.io/ssl-redirect|alb.ingress.kubernetes.io/ssl-redirect>: '443'
      <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: 'arn:aws:acm:eu-central-1:xxxxxx''
      <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: infrastructure
      <http://alb.ingress.kubernetes.io/healthcheck-path|alb.ingress.kubernetes.io/healthcheck-path>: '/admin'
      <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: '[{"HTTP": 80}, {"HTTPS":443}]'
      <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: 'internal'
      <http://alb.ingress.kubernetes.io/target-type|alb.ingress.kubernetes.io/target-type>: 'ip'
      <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: 'alb'
    hosts:
      - host: datahub.example.local
        paths:
          - '/*'
  extraVolumes:
    - name: datahub-users
      secret:
       defaultMode: 0444
       secretName: datahub-users-secret
  extraVolumeMounts:
    - name: datahub-users
      mountPath: /datahub-frontend/conf/user.props
      subPath: user.props
  service:
    type: "ClusterIP"
acryl-datahub-actions:
  enabled: true
  image:
    repository: acryldata/datahub-actions
    tag: "v0.0.11"
  resources:
    limits:
      memory: 2Gi
    requests:
      cpu: 300m
      memory: 2Gi
datahub-mae-consumer:
  image:
    repository: linkedin/datahub-mae-consumer
  resources:
    limits:
      memory: 1536Mi
    requests:
      cpu: 100m
      memory: 1536Mi
datahub-mce-consumer:
  image:
    repository: linkedin/datahub-mce-consumer
  resources:
    limits:
      memory: 1536Mi
    requests:
      cpu: 100m
      memory: 1536Mi
datahub-ingestion-cron:
  enabled: false
  image:
    repository: acryldata/datahub-ingestion
elasticsearchSetupJob:
  enabled: true
  image:
    repository: linkedin/datahub-elasticsearch-setup
  extraEnvs:
    - name: USE_AWS_ELASTICSEARCH
      value: "true"
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 512Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
kafkaSetupJob:
  enabled: true
  image:
    repository: linkedin/datahub-kafka-setup
  resources:
    limits:
      cpu: 500m
      memory: 1024Mi
    requests:
      cpu: 300m
      memory: 1024Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
mysqlSetupJob:
  enabled: false
  image:
    repository: acryldata/datahub-mysql-setup
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 512Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
postgresqlSetupJob:
  enabled: true
  image:
    repository: acryldata/datahub-postgres-setup
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 512Mi
  podSecurityContext:
    fsGroup: 1000
  securityContext:
    runAsUser: 1000
  podAnnotations: {}
datahubUpgrade:
  enabled: true
  image:
    repository: acryldata/datahub-upgrade
  batchSize: 1000
  batchDelayMs: 100
  noCodeDataMigration:
    sqlDbType: "POSTGRES"
  podSecurityContext: {}
  securityContext: {}
  podAnnotations: {}
  restoreIndices:
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 300m
        memory: 512Mi
global:
  strict_mode: true
  graph_service_impl: elasticsearch
  datahub_analytics_enabled: false
  datahub_standalone_consumers_enabled: false
  elasticsearch:
    host: "<http://vpc-datahub-xxxxxxxxxxxxxxxxxxxxxxxxxx.eu-central-1.es.amazonaws.com|vpc-datahub-xxxxxxxxxxxxxxxxxxxxxxxxxx.eu-central-1.es.amazonaws.com>"
    port: "443"
    skipcheck: "false"
    insecure: "false"
    useSSL: "true"
    region: eu-central-1
    auth:
      username: datahub
      password:
        secretRef: opensearch-secret
        secretKey: opensearch-password
    index:
      enableMappingsReindex: true
      enableSettingsReindex: true
      upgrade:
        cloneIndices: true
        allowDocCountMismatch: false
    search:
      maxTermBucketSize: 20
      exactMatch:
        exclusive: false
        withPrefix: true
        exactFactor: 2.0
        prefixFactor: 1.6
        caseSensitivityFactor: 0.7
        enableStructured: true
      graph:
        timeoutSeconds: 50
        batchSize: 1000
        maxResult: 10000
  kafka:
    bootstrap:
      server: "<http://b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092|b-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092,b-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:9092>"
    zookeeper:
      server: "<http://z-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181|z-3.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-2.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181,z-1.datahubcluster.xxxxxx.c3.kafka.eu-central-1.amazonaws.com:2181>"
    topics:
      metadata_change_event_name: "MetadataChangeEvent_v4"
      failed_metadata_change_event_name: "FailedMetadataChangeEvent_v4"
      metadata_audit_event_name: "MetadataAuditEvent_v4"
      datahub_usage_event_name: "DataHubUsageEvent_v1"
      metadata_change_proposal_topic_name: "MetadataChangeProposal_v1"
      failed_metadata_change_proposal_topic_name: "FailedMetadataChangeProposal_v1"
      metadata_change_log_versioned_topic_name: "MetadataChangeLog_Versioned_v1"
      metadata_change_log_timeseries_topic_name: "MetadataChangeLog_Timeseries_v1"
      platform_event_topic_name: "PlatformEvent_v1"
      datahub_upgrade_history_topic_name: "DataHubUpgradeHistory_v1"
    schemaregistry:
      url: "<http://datahub-prerequisites-cp-schema-registry:8081>"
      type: KAFKA
    paritions: 3
    replicationFactor: 3
  neo4j:
    host: "prerequisites-neo4j-community:7474"
    uri: "<bolt://prerequisites-neo4j-community>"
    username: "neo4j"
    password:
      secretRef: neo4j-secrets
      secretKey: neo4j-password
  sql:
    datasource:
      host: "datahubdb.example.local:5432"
      hostForpostgresqlClient: "datahubdb.example.local"
      port: "5432"
      url: "jdbc:<postgresql://datahubdb.example.local:5432/datahub>"
      driver: "org.postgresql.Driver"
      username: "svc_datahub"
      password:
        secretRef: postgres-secret
        secretKey: postgres-root-password
  datahub:
    gms:
      port: "8080"
      nodePort: "30001"
    monitoring:
      enablePrometheus: true
    mae_consumer:
      port: "9091"
      nodePort: "30002"
    encryptionKey:
      secretRef: "datahub-encryption-secrets"
      secretKey: "encryption_key_secret"
      provisionSecret:
        enabled: true
        autoGenerate: true
    managed_ingestion:
      enabled: true
    metadata_service_authentication:
      enabled: false
      systemClientId: "__datahub_system"
      systemClientSecret:
        secretRef: "datahub-auth-secrets"
        secretKey: "token_service_signing_key"
      tokenService:
        signingKey:
          secretRef: "datahub-auth-secrets"
          secretKey: "token_service_signing_key"
        salt:
          secretRef: "datahub-auth-secrets"
          secretKey: "token_service_salt"
      provisionSecrets:
        enabled: true
        autoGenerate: true
    alwaysEmitChangeLog: true
    enableGraphDiffMode: true
We use AWS RDS (Postgres), MSK and OpenSearch and only need the schema-registry from datahub-prerequisites
h
please keep this thread posted on findings and potential resolution!
i
Hello Peter, How are you deploying these helm charts? For secret provisioning our helm charts default to a
lookup
method which requires access to the underlying Kubernetes cluster. If that access is somehow not direct then it will not work. To fix this issue I would recommend setting:
global.datahub.encryptionKey.provisionSecret.autoGenerate
and
global.datahub.metadata_service_authentication.provisionSecrets.autoGenerate
to false. This will force to either specify the secret values in the `values.yaml`file OR provision the secret yourself and reference it like: https://github.com/acryldata/datahub-helm/blob/d56333b25996172ae68c01b4aa2f3d0d7de51b05/charts/datahub/values.yaml#L576 For more information on the helm lookup function, see: https://helm.sh/docs/chart_template_guide/functions_and_pipelines/#using-the-lookup-function
w
Hi @incalculable-ocean-74010, thank you for following up. We deploy Datahub via ArgoCD and all secrets (I mean postgres and ES) are created via ExternalSecretsOperator. The secret “datahub-encryption-secrets” is present on our cluster and I guess previously created via that autoGenerate setting/feature
i
Correct. In which case you should make the encryption secret an external secret and disable auto generation + secret provisioning
w
OK. will do so and get back to you ASAP
What about the datahub-app-gms-secret?
BTW, I applied the changes you suggested. datahub-encryption-secret is now a separately deployed secret
h
@white-guitar-82227 can you tell if this resolved the issue
w
@helpful-tent-87247 we now have rerun our tests. Restarted the GMS pod. Entered the SINK_TOKEN. Ran ingestions. Restarted the GMS pod again and reran the tests again to see if the Pod restart killed the SINK_TOKEN secret. A postgres ingestion ran fine both times. A MongoDB ingestion failed the second time due to a missing SINK_TOKEN (the same token that the postgres ingestion ran fine with). The GMS log output:
Copy code
2023-07-11 13:18:31,250 [qtp1645547422-279] INFO  c.l.m.r.entity.AspectResource:171 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...6138227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=51,bytes=7b227374...3234367d)}, changeType=UPSERT}
2023-07-11 13:18:31,261 [pool-13-thread-5] INFO  c.l.m.filter.RestliLoggingFilter:55 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 11ms
2023-07-11 13:18:31,269 [ForkJoinPool.commonPool-worker-19] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:21 - Failed to execute DataFetcher
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@6e4557d4
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: Failed to perform update against input com.linkedin.datahub.graphql.generated.GetSecretValuesInput@6e4557d4
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:87)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
	... 6 common frames omitted
Caused by: java.lang.RuntimeException: Failed to decrypt value using provided secret!
	at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:80)
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.decryptSecret(GetSecretValuesResolver.java:95)
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$1(GetSecretValuesResolver.java:77)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at com.linkedin.datahub.graphql.resolvers.ingest.secret.GetSecretValuesResolver.lambda$get$2(GetSecretValuesResolver.java:85)
	... 7 common frames omitted
Caused by: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.
	at java.base/com.sun.crypto.provider.CipherCore.unpad(CipherCore.java:975)
	at java.base/com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1056)
	at java.base/com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
	at java.base/com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
	at java.base/javax.crypto.Cipher.doFinal(Cipher.java:2202)
	at com.linkedin.metadata.secret.SecretService.decrypt(SecretService.java:78)
	... 17 common frames omitted
2023-07-11 13:18:31,269 [ForkJoinPool.commonPool-worker-23] ERROR c.datahub.graphql.GraphQLController:107 - Errors while executing graphQL query: "query getSecretValues($input: GetSecretValuesInput!) {\n\n                getSecretValues(input: $input) {\n\n                    name\n\n                    value\n\n                }\n\n            }", result: {errors=[{message=An unknown error occurred., locations=[{line=3, column=17}], path=[getSecretValues], extensions={code=500, type=SERVER_ERROR, classification=DataFetchingException}}], data={getSecretValues=null}, extensions={tracing={version=1, startTime=2023-07-11T13:18:31.265782Z, endTime=2023-07-11T13:18:31.269727Z, duration=3947655, parsing={startOffset=330873, duration=292201}, validation={startOffset=572721, duration=220571}, execution={resolvers=[{path=[getSecretValues], parentType=Query, returnType=[SecretValue!], fieldName=getSecretValues, startOffset=670785, duration=2645164}]}}}}, errors: [DataHubGraphQLError{path=[getSecretValues], code=SERVER_ERROR, locations=[SourceLocation{line=3, column=17}]}]
2023-07-11 13:18:31,274 [qtp1645547422-279] INFO  c.l.m.r.entity.AspectResource:171 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...6138227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=1568,bytes=7b227374...2032377d)}, changeType=UPSERT}
2023-07-11 13:18:31,290 [pool-13-thread-4] INFO  c.l.m.filter.RestliLoggingFilter:55 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 16ms
2023-07-11 13:18:31,691 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 7 Took time ms: -1
h
very interesting that the same secret was available for one ingestion job and not for another
w
Indeed
h
ok so i'm going to set these values:
Copy code
encryptionKey:
      secretRef: "datahub-encryption-secrets"
      secretKey: "encryption_key_secret"
      # Set to false if you'd like to provide your own secret.
      provisionSecret:
        enabled: false
        autoGenerate: false
        annotations: {}
      # Only specify if autoGenerate set to false
      #  secretValues:
      #    encryptionKey: <encryption key value>
but was there anything else you did to make sure the datahub-encryption-secret was available as an external secret?
w
I reused the secret value from the time the secret was auto-generated. It’s just that the secret object itself was now deployed externally
The ingestion YAML is exactly the same in both ingestions
@helpful-tent-87247 is there anything you’d need from me apart from the above provided information?
h
we fixed it
w
Hi @helpful-tent-87247, does version 0.10.5 contain this fix? So far we cannot confirm out problem got resolved
w
Hi @helpful-tent-87247 - is this fix part of Version 0.11.0 ? We (@white-guitar-82227 is a colleague of mine) were still seeing the issue with v 0.10.5
b
Hi @helpful-tent-87247, can you elaborate on what you did? Just simply saying you fixed it and nothing else really doesn’t help anyone else who is facing this issue.
w
@bumpy-manchester-97826 as for our part we can confirm that the problem is gone with release v0.11.0
b
Thanks for the reply @white-guitar-82227 We’re on v0.12.0 now and everything woking as expected