creamy-machine-95935
03/23/2023, 4:46 PMwhite-shampoo-69122
03/23/2023, 5:16 PMfieldPaths
as version 2:
fields:
0:
description: null
fieldPath: "[version=2.0].[type=string].a_column_x"
globalTags: null
glossaryTerms: null
isPartOfKey: false
jsonPath: null
label: null
nativeDataType: "string"
nullable: true
recursive: false
type: "STRING"
__typename: "SchemaField"
While the dbt sibling entity has fieldPaths version 1(?):
siblings:
isPrimary: false
siblings:
0:
...
schemaMetadata:
fields:
...
60:
description: "A description for column x"
fieldPath: "a_column_x"
globalTags: null
glossaryTerms: null
isPartOfKey: false
jsonPath: null
label: null
nativeDataType: "varchar"
nullable: false
recursive: false
type: "STRING"
__typename: "SchemaField"
Not really sure how it was before but found that interesting.
Also might be worth mentioning that we have enable statful ingestion for both dbt and Glue and that it worked well before.
Any ideas what might be going wrong?flaky-portugal-377
03/23/2023, 6:32 PMacceptable-football-40437
03/23/2023, 6:39 PMurn:li:container:<alphanumeric-string>
, but from the docs it seems more human-readable (and programmatically guessable) versions of URNs exist. How does one put together the latter type of URN for, say, a BigQuery dataset?
If this belongs in a difference channel, I'll happily cross-post!strong-hospital-52301
03/23/2023, 6:45 PMbumpy-activity-74405
03/24/2023, 7:01 AMHTTP header value exceeds the configured limit of 8192 characters
in frontend. I was able to work around it with env variables introduced in this PR and it worked on version v0.8.44
. After upgrading to v0.9.6.1
the issue is back. I suspect it has to do with renaming the configuration option in this commit. Not sure why it was done since akka documentation states it should be max-header-value-length
.swift-dream-78272
03/24/2023, 1:53 PMfrom datahub.ingestion.run.pipeline import Pipeline
# The pipeline configuration is similar to the recipe YAML files provided to the CLI tool.
pipeline = Pipeline.create(
{
"source": {
"type": "mysql",
"config": {
"username": "user",
"password": "pass",
"database": "db_name",
"host_port": "localhost:3306",
},
},
"sink": {
"type": "datahub-rest",
"config": {"server": "${DATAHUB_GMS_URL}",},
},
}
)
# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()
brief-bear-90340
03/24/2023, 4:07 PM[12:05 PM] DefaultCredentialsError: ('Failed to load service account credentials from /tmp/tmp53gmekv8', ValueError('Could not deserialize key data. The data may be in an incorrect format, it may be encrypted with an unsupported algorithm, or it may be an unsupported key type (e.g. EC curves with explicit parameters).', [<OpenSSLError(code=503841036, lib=60, reason=524556, reason_text=unsupported)>]))
any help with this would be appreciatedrapid-zoo-88437
03/25/2023, 3:28 AMval ds1 = spark.read
.format("jdbc")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("url", "jdbc:mysql://{myhost}:3306/xxx")
.option("dbtable", "Persons")
.option("user", "xxx")
.option("password", "xxx")
.load()
ds1.write.mode(SaveMode.Append)
.format("jdbc")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("url", "jdbc:mysql://{myhost}:3306/xxx")
.option("dbtable", "Persons1")
.option("user", "xxx")
.option("password", "xxx")
.save()
ds1.write.mode(SaveMode.Append)
.format("jdbc")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("url", "jdbc:mysql://{myhost}:3306/xxx")
.option("dbtable", "Persons2")
.option("user", "xxx")
.option("password", "xxx")
.save()
fresh-cricket-75926
03/27/2023, 11:35 AMwonderful-quill-11255
03/27/2023, 12:57 PMvX.Y.Z
that the regular datahub code does.
Up until now that hasn't mattered a lot since that value was mainly used in the UI to show the version running.
But it seems that recently this value has become more important, controlling a step in the bootstrap process.
I'm wondering if anyone else have encountered this and how you chose to deal with it.
Best Regardslimited-refrigerator-50812
03/27/2023, 3:59 PM./gradlew quickstartDebug --stacktrace -x yarnTest -x yarnLint
I get an error that I don't know how to deal with. Including the the error message(s) below. Any idea how I can find out what I did wrong?
> Task :datahub-web-react:yarnGenerateyarn run v1.22.0
$ graphql-codegen --config codegen.yml0s]
(node:4132) ExperimentalWarning: stream/web is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
[15:13:03] Parse configuration [started]]
[15:13:03] Parse configuration [completed]
[15:13:03] Generate outputs [started]
[15:13:03] Generate src/types.generated.ts [started]
[15:13:03] Generate to src/ (using EXPERIMENTAL preset "near-operation-file") [started]
[15:13:03] Load GraphQL schemas [started]
[15:13:03] Load GraphQL schemas [started]
[15:13:03] Load GraphQL schemas [failed]
[15:13:03] → Failed to load schema
[15:13:03] Generate to src/ (using EXPERIMENTAL preset "near-operation-file") [failed]
[15:13:03] → Failed to load schema
[15:13:03] Load GraphQL schemas [failed]
[15:13:03] → Failed to load schema
[15:13:03] Generate src/types.generated.ts [failed]
[15:13:03] → Failed to load schema
[15:13:03] Generate outputs [failed]
Something went wrong
error Command failed with exit code 1.
info Visit <https://yarnpkg.com/en/docs/cli/run> for documentation about this command.
> :datahub-web-react:yarnGenerate
> Task :datahub-web-react:yarnGenerate FAILED
> :datahub-web-react:yarnGenerate
FAILURE: Build failed with an exception.]
* What went wrong:
Execution failed for task ':datahub-web-react:yarnGenerate'.
> Process 'command '/mnt/c/Users/dries528/Documents/Code/datahub_fresh/datahub/datahub-web-react/.gradle/yarn/yarn-v1.22.0/bin/yarn'' finished with non-zero exit value 1
* Try:
Run with --info or --debug option to get more log output. Run with --scan to get full insights.
bulky-grass-52762
03/27/2023, 6:59 PM0.9.3
to 0.10.1
, we discovered that certain nodes in the lineage UI have disappeared. These nodes were not entities themselves, but rather were connected to other entities as upstream/downstream dependencies.
For example in our use case as attached in the screenshot, we used s3 lineage aspect to complete the flow of hive -> s3 -> redshift, but that flow seems to be broken because in 0.10.1
, the lineage aspects seems to be missing in the lineage UI. I believe this is because of the implementation of showing an error message if the entity is not found. IMHO, this shouldn’t have impacted the nodes in the lineage UI, since the original redshift ingestion is still offloading the related s3 upstream lineage aspect without the entity itself.
TIA for your future efforts looking at this thankyoucuddly-butcher-39945
03/27/2023, 10:26 PMnumerous-account-62719
03/28/2023, 6:28 AMmicroscopic-room-90690
03/28/2023, 8:01 AMbumpy-activity-74405
03/28/2023, 10:04 AMv0.9.6.1
. Having issues with the download csv feature when trying to download ~5k datasets. It's my understanding that it tries to batch these queries in chunks of 1000, but each chunk takes longer and longer until I get a timeout error in gms:
09:50:18.820 [qtp71399214-1199] WARN o.s.w.s.m.s.DefaultHandlerExceptionResolver:208 - Resolved [org.springframework.web.context.request.async.AsyncRequestTimeoutException]
I am getting similar results when trying to run a graphql query - if I set the count
to 10000 (everything in one chunk) it times out. If I try to batch my queries using offsets (start/count
) I can observe that with increasing offsets I also get increasing query run times which eventually time out when reaching 30s. Is there something that I could do about this - increase timeout somehow or should I somehow scale elasticsearch?bright-morning-76046
03/28/2023, 11:47 AMUnable to run quickstart - the following issues were detected:
- datahub-gms is running by not yet healthy
- datahub-upgrade is still running
If you think something went wrong, please file an issue at <https://github.com/datahub-project/datahub/issues>
or send a message in our Slack <https://slack.datahubproject.io/>
Be sure to attach the logs from /var/folders/c2/3gbwy5wj5dbfvgjzz3kctd000000gp/T/tmpsdxu45ta.log
My Version is DataHub CLI version: 0.10.1
Thank you so much!mysterious-advantage-78411
03/28/2023, 12:36 PMbest-wire-59738
03/29/2023, 1:50 AMfierce-monkey-46092
03/29/2023, 6:40 AMbusy-mechanic-8014
03/29/2023, 9:00 AMmetadata_service_authentication:
enabled: true
systemClientId: "__datahub_system"
systemClientSecret:
secretRef: "datahub-auth-secrets"
secretKey: "token_service_signing_key"
tokenService:
signingKey:
secretRef: "datahub-auth-secrets"
secretKey: "token_service_signing_key"
salt:
secretRef: "datahub-auth-secrets"
secretKey: "token_service_salt"
# Set to false if you'd like to provide your own auth secrets
provisionSecrets:
enabled: true
autoGenerate: true
# Only specify if autoGenerate set to false
# secretValues:
# secret: <secret value>
# signingKey: <signing key value>
# salt: <salt value>
=> I’ve now a secret with token_service_signing_key: f2E0BZoNKlr7CEu71kjZjAduRNCsePKS
Create programmatically the access token
• Decode an access token created on the UI and get the payload
{
"actorType": "USER",
"actorId": "datahub",
"type": "PERSONAL",
"version": "2",
"jti": "6ec82917-d39a-4c52-9a5e-5d4caacf6b7d",
"sub": "datahub",
"exp": 1680015431,
"iss": "datahub-metadata-service"
}
• I validated the service key by recreating the token by my own means (just used https://jwt.io/ with payload, header and token signing key)
• Create a new token in Python
import jwt
import time
# I noticed that you have to encode the service key in ASCII to get the same verified signature as the token created on the UI (anyway I tested with or without for the same result)
secret_signing_key = "f2E0BZoNKlr7CEu71kjZjAduRNCsePKS".encode('ascii')
payload = {
"actorType": "USER",
"actorId": "datahub",
"type": "PERSONAL",
"version": "2",
"jti": "6ec82917-d39a-4c52-9a5e-5d4caacf6b7d",
"sub": "datahub",
"exp": 1680015431,
"iss": "datahub-metadata-service"
}
header = { "alg": "HS256" }
token = jwt.encode(payload, secret, headers=header)
print(token)
eyJhbGciOiJIUzI1NiJ9…
• Decode my new access token to check if it is well built => all looks good
*cURL (*Curl proposed when creating a token on the UI)
curl -X POST "<http://datahub-front-url/api/graphql>" --header 'Authorization: Bearer eyJhbGciOiJIUzI1NiJ9… ' --header 'Content-Type: application/json' --data-raw '{"query": "{\n me {\n corpUser {\n username\n }\n }\n}","variables":{}}'
=> HTTP ERROR 401 Unauthorized to perform this action
Datahub API
datahub ingest -c /tmp/ch_recipe.yml
ch_recipe.yml:
source:
type: clickhouse
config:
host_port: "clickhouse-install.clickhouse.svc.cluster.local:8123"
username: ****
password: ****
platform_instance: DatabaseNameToBeIngested
include_views: true
include_tables: true
sink:
type: "datahub-rest"
config:
server: "<http://datahub-gms.datahub.svc.cluster.local:8080>"
token: "eyJhbGciOiJIUzI1NiJ9…."
=> 401 Client Error: Unauthorized for url
All works fine if I put a token created on the UI.
Questions
Has anyone managed to create a token programmatically and used it for queries? Is it really possible to do that now?
I also noticed (if I understood correctly) that if I create a token via the UI, retrieve it but delete it immediately afterwards, it's as if I simulate creating the token programmatically and get this result. If we can really create our own token with the token signing key, we should be able to use this token (present or not on the UI) to request datahub. On my side it doesn't work.
I remain available if you need more information! 🙂
Thanks for your time and I hope someone can help me out!astonishing-dusk-99990
03/29/2023, 9:21 AMdatahub-frontend:
enabled: true
image:
repository: linkedin/datahub-frontend-react
tag: "v0.10.0" # # defaults to .global.datahub.version
resources:
limits:
memory: 1400Mi
requests:
cpu: 100m
memory: 512Mi
# Set up ingress to expose react front-end
ingress:
enabled: false
oidcAuthentication: # OIDC auth based on <https://datahubproject.io/docs/authentication/guides/sso/configure-oidc-react>
enabled: false
extraEnvs:
- name: AUTH_JAAS_ENABLED
value: "true"
- name: AUTH_OIDC_ENABLED
value: "true"
- name: AUTH_OIDC_CLIENT_ID
value: "your_oidc_client_id"
- name: AUTH_OIDC_CLIENT_SECRET
value: your_client_secret
- name: AUTH_OIDC_DISCOVERY_URI
value: "<https://accounts.google.com/.well-known/openid-configuration>"
- name: AUTH_OIDC_BASE_URL
value: "<http://localhost:9002>"
- name: AUTH_OIDC_USER_NAME_CLAIM
value: "email"
- name: AUTH_OIDC_USER_NAME_CLAIM_REGEX
value: "([^@]+)"
extraVolumes:
- name: datahub-users
secret:
defaultMode: 0444
secretName: datahub-users-secret
extraVolumeMounts:
- name: datahub-users
mountPath: /datahub-frontend/conf/user.props
#mountPath: /etc/datahub/plugins/frontend/auth/user.props
subPath: user.props
And then I followed this article to set up google and already set up my Authorized Javascript Origins and Authorized Redirect URLs in attachment below.
However when I tested, It showed google sign in with my personal gmail and work gmail. Then, first I tried to test with my personal gmail and the result is as expected which is access blocked, but when I use my work gmail always refused to connect like attachment below.
My question, what’s the problem here can anyone here help me?
Notes:
• I already allow port 9002 in firewall rule
• My version image is 0.10.0
• Deployed using helm chart on kubernetes clusterpowerful-cat-68806
03/29/2023, 10:26 AMhelm upgrade
from my local, but I’m not seeing the latest updates from the announcement
This is my chart
apiVersion: v2
name: jfrog-datahub
description: A Helm chart for Acryl DataHubd
type: application
version: 0.0.1
appVersion: latest #0.3.1
dependencies:
- name: datahub
version: 0.2.148
repository: <https://helm.datahubproject.io>
icy-flag-80360
03/29/2023, 12:10 PMuppressed: org.elasticsearch.client.ResponseException: method [POST], host [<http://elasticsearch-master:9200>], URI [/datahubpolicyindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"GZJPC-CBTtekUqWRmtZGfA","index":"datahubpolicyindex_v2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"datahubpolicyindex_v2","node":"MhPwkRJ4T8WYHY0QwONrOg","reason":{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"GZJPC-CBTtekUqWRmtZGfA","index":"datahubpolicyindex_v2"}}]},"status":400}
But if I'm check from curl in GMS pod - all ok, elastic returns data with existing policies, but without any index_uuid. Example:
curl -XGET 'http://elasticsearch-master:9200/datahubpolicyindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true'
Is any ways to repair it? I'll tried many ways to recover with full erasing elastic data too.adventurous-waiter-4058
03/29/2023, 12:16 PMmicroscopic-leather-94537
03/29/2023, 12:17 PMfast-midnight-10167
03/29/2023, 1:31 PMMetadataChangeProposalWrapper
, emit changes to add new custom properties. But the problem is, when I give the entityUrn the name (via make_dataset_urn
), it treats the filepath both as the filepath and name of the object. So instead of ending up with a <env>/<folderpath>/<obj_name>
in datahub, I end up with that path, but the object name itself includes the folder path as if it was part of the object name itself.wide-optician-47025
03/29/2023, 5:29 PMglamorous-microphone-33484
03/30/2023, 12:51 AM