Hi all..... I've got a strange problem which I hop...
# ingestion
m
Hi all..... I've got a strange problem which I hope someone can help with. I can successfully ingest a table using a recipe and the CLI and can view this table on the UI. However, when I try to ingest the same table using an Airflow DAG, I get the following error trying to view it on the UI:
And the following error appears in the datahub UI log:
Copy code
05:42:06 [Thread-14773] ERROR c.l.datahub.graphql.GmsGraphQLEngine - Failed to load Entities of type: Dataset, keys: [urn:li:dataset:(urn:li:dataPlatform:redshift,datalake.dms_fxcu_arbor_fxcu_arbor.cmf_dms,DEV)] Failed to batch load Datasets
05:42:06 [Thread-14773] WARN  n.g.e.SimpleDataFetcherExceptionHandler - Exception while fetching data (/browse/entities) : java.lang.RuntimeException: Failed to retrieve entities of type Dataset
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to retrieve entities of type Dataset
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to retrieve entities of type Dataset
	at com.linkedin.datahub.graphql.GmsGraphQLEngine.lambda$null$102(GmsGraphQLEngine.java:719)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
	... 1 common frames omitted
Caused by: java.lang.RuntimeException: Failed to batch load Datasets
	at com.linkedin.datahub.graphql.types.dataset.DatasetType.batchLoad(DatasetType.java:102)
	at com.linkedin.datahub.graphql.GmsGraphQLEngine.lambda$null$102(GmsGraphQLEngine.java:716)
	... 2 common frames omitted
Caused by: java.lang.NullPointerException: null
05:42:06 [application-akka.actor.default-dispatcher-97029] ERROR react.controllers.GraphQLController - Errors while executing graphQL query: "query getBrowseResults($input: BrowseInput!) {\n  browse(input: $input) {\n    entities {\n      urn\n      type\n      ... on Dataset {\n        name\n        origin\n        description\n        platform {\n          name\n          info {\n            logoUrl\n            __typename\n          }\n          __typename\n        }\n        tags\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        globalTags {\n          ...globalTagsFields\n          __typename\n        }\n        __typename\n      }\n      ... on Dashboard {\n        urn\n        type\n        tool\n        dashboardId\n        info {\n          name\n          description\n          externalUrl\n          access\n          lastModified {\n            time\n            __typename\n          }\n          __typename\n        }\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        globalTags {\n          ...globalTagsFields\n          __typename\n        }\n        __typename\n      }\n      ... on GlossaryTerm {\n        name\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        glossaryTermInfo {\n          definition\n          termSource\n          sourceRef\n          sourceUrl\n          customProperties {\n            key\n            value\n            __typename\n          }\n          __typename\n        }\n        __typename\n      }\n      ... on Chart {\n        urn\n        type\n        tool\n        chartId\n        info {\n          name\n          description\n          externalUrl\n          type\n          access\n          lastModified {\n            time\n            __typename\n          }\n          __typename\n        }\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        globalTags {\n          ...globalTagsFields\n          __typename\n        }\n        __typename\n      }\n      ... on DataFlow {\n        urn\n        type\n        orchestrator\n        flowId\n        cluster\n        info {\n          name\n          description\n          project\n          __typename\n        }\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        globalTags {\n          ...globalTagsFields\n          __typename\n        }\n        __typename\n      }\n      ... on MLFeatureTable {\n        urn\n        type\n        name\n        description\n        featureTableProperties {\n          description\n          mlFeatures {\n            urn\n            __typename\n          }\n          mlPrimaryKeys {\n            urn\n            __typename\n          }\n          __typename\n        }\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        platform {\n          name\n          info {\n            logoUrl\n            __typename\n          }\n          __typename\n        }\n        __typename\n      }\n      ... on MLModel {\n        name\n        origin\n        description\n        tags\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        globalTags {\n          ...globalTagsFields\n          __typename\n        }\n        platform {\n          name\n          info {\n            logoUrl\n            __typename\n          }\n          __typename\n        }\n        __typename\n      }\n      ... on MLModelGroup {\n        name\n        origin\n        description\n        ownership {\n          ...ownershipFields\n          __typename\n        }\n        platform {\n          name\n          info {\n            logoUrl\n            __typename\n          }\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    groups {\n      name\n      count\n      __typename\n    }\n    start\n    count\n    total\n    metadata {\n      path\n      totalNumEntities\n      __typename\n    }\n    __typename\n  }\n}\n\nfragment ownershipFields on Ownership {\n  owners {\n    owner {\n      ... on CorpUser {\n        urn\n        type\n        username\n        info {\n          active\n          displayName\n          title\n          email\n          firstName\n          lastName\n          fullName\n          __typename\n        }\n        editableInfo {\n          pictureLink\n          __typename\n        }\n        __typename\n      }\n      ... on CorpGroup {\n        urn\n        type\n        name\n        info {\n          email\n          admins {\n            urn\n            username\n            info {\n              active\n              displayName\n              title\n              email\n              firstName\n              lastName\n              fullName\n              __typename\n            }\n            editableInfo {\n              pictureLink\n              teams\n              skills\n              __typename\n            }\n            __typename\n          }\n          members {\n            urn\n            username\n            info {\n              active\n              displayName\n              title\n              email\n              firstName\n              lastName\n              fullName\n              __typename\n            }\n            editableInfo {\n              pictureLink\n              teams\n              skills\n              __typename\n            }\n            __typename\n          }\n          groups\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    type\n    __typename\n  }\n  lastModified {\n    time\n    __typename\n  }\n  __typename\n}\n\nfragment globalTagsFields on GlobalTags {\n  tags {\n    tag {\n      urn\n      name\n      description\n      __typename\n    }\n    __typename\n  }\n  __typename\n}\n", result: {errors=[{message=Exception while fetching data (/browse/entities) : java.lang.RuntimeException: Failed to retrieve entities of type Dataset, locations=[{line=3, column=5}], path=[browse, entities], extensions={classification=DataFetchingException}}], data={browse=null}}, errors: [ExceptionWhileDataFetching{path=[browse, entities], exception=java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to retrieve entities of type Dataset, locations=[SourceLocation{line=3, column=5}]}]
And the following entries in the gms log:
Copy code
05:42:01.699 [qtp544724190-76] INFO  c.l.m.r.entity.EntityResource - GET urn:li:corpuser:datahub
05:42:01.703 [qtp544724190-13] INFO  c.l.m.r.entity.EntityResource - GET BROWSE RESULTS for dataset at path /dev/redshift
05:42:01.703 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - GET /entities/urn%3Ali%3Acorpuser%3Adatahub - get - 200 - 4ms
05:42:01.706 [qtp544724190-202] INFO  c.l.m.r.entity.EntityResource - GET urn:li:corpuser:datahub
05:42:01.708 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - POST /entities?action=browse - browse - 200 - 5ms
05:42:01.710 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - GET /entities/urn%3Ali%3Acorpuser%3Adatahub - get - 200 - 4ms
05:42:01.712 [I/O dispatcher 1] INFO  c.l.m.k.e.ElasticsearchConnector - Successfully feeded bulk request. Number of events: 1 Took time ms: -1
05:42:01.719 [I/O dispatcher 1] INFO  c.l.m.k.e.ElasticsearchConnector - Successfully feeded bulk request. Number of events: 1 Took time ms: -1
05:42:04.269 [qtp544724190-212] INFO  c.l.m.r.entity.EntityResource - GET urn:li:corpuser:datahub
05:42:04.271 [qtp544724190-204] INFO  c.l.m.r.entity.EntityResource - GET BROWSE RESULTS for dataset at path /dev/redshift/datalake
05:42:04.273 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - GET /entities/urn%3Ali%3Acorpuser%3Adatahub - get - 200 - 4ms
05:42:04.274 [qtp544724190-77] INFO  c.l.m.r.entity.EntityResource - GET urn:li:corpuser:datahub
05:42:04.275 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - POST /entities?action=browse - browse - 200 - 4ms
05:42:04.278 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - GET /entities/urn%3Ali%3Acorpuser%3Adatahub - get - 200 - 4ms
05:42:04.281 [I/O dispatcher 1] INFO  c.l.m.k.e.ElasticsearchConnector - Successfully feeded bulk request. Number of events: 1 Took time ms: -1
05:42:04.288 [I/O dispatcher 1] INFO  c.l.m.k.e.ElasticsearchConnector - Successfully feeded bulk request. Number of events: 1 Took time ms: -1
05:42:06.750 [qtp544724190-13] INFO  c.l.m.r.entity.EntityResource - GET urn:li:corpuser:datahub
05:42:06.752 [qtp544724190-12] INFO  c.l.m.r.entity.EntityResource - GET BROWSE RESULTS for dataset at path /dev/redshift/datalake/dms_fxcu_arbor_fxcu_arbor
05:42:06.754 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - GET /entities/urn%3Ali%3Acorpuser%3Adatahub - get - 200 - 4ms
05:42:06.756 [qtp544724190-67] INFO  c.l.m.r.entity.EntityResource - GET urn:li:corpuser:datahub
05:42:06.758 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - POST /entities?action=browse - browse - 200 - 6ms
05:42:06.759 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - GET /entities/urn%3Ali%3Acorpuser%3Adatahub - get - 200 - 3ms
05:42:06.759 [qtp544724190-10] INFO  c.l.m.r.entity.EntityResource - BATCH GET [urn:li:dataset:(urn:li:dataPlatform:redshift,datalake.dms_fxcu_arbor_fxcu_arbor.cmf_dms,DEV)]
05:42:06.762 [I/O dispatcher 1] INFO  c.l.m.k.e.ElasticsearchConnector - Successfully feeded bulk request. Number of events: 1 Took time ms: -1
05:42:06.765 [pool-11-thread-1] INFO  c.l.metadata.filter.LoggingFilter - GET /entities?ids=List(urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Aredshift%2Cdatalake.dms_fxcu_arbor_fxcu_arbor.cmf_dms%2CDEV%29) - batchGet - 200 - 6ms
05:42:06.769 [I/O dispatcher 1] INFO  c.l.m.k.e.ElasticsearchConnector - Successfully feeded bulk request. Number of events: 1 Took time ms: -1
I am able to successfully get details on both occurences on this table using a curl request:
Copy code
curl --location --request GET 'http://<MY_SERVER>:8080/entities/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Aredshift%2Cdatalake.dms_fxcu_arbor_fxcu_arbor.cmf_dms%2CDEV)'
'DEV' is the table I added using airflow (and doesn't work on the UI), and PROD is the table I added using the CLI (and works on the UI)....
Does anyone have any ideas !!!
b
hey simon. i can see there's an NPE in the frontend server... will take a look!
are you on the latest version?
Also do you mind pasting what you get back on your CURL request shown ^? Of course feel free to redact anything that may be sensitive
m
Also @magnificent-camera-71872: do you have the same pip package version in your Airflow installation as the local recipe?
m
oh.. thats interesting. On the airflow worker:
acryl-datahub==0.8.10.0
On the cli:
acryl-datahub==0.8.8.0
The UI was installed at a similar time as the CLI machine and is most likely a similar level. Is there anyway I can check ?
@big-carpet-38439 I'm not sure i can share the schema on a public forum, but I svaed the output from both curl requests and then diff'd the 2 files:
Copy code
diff cmf_dms_cli_formatted.json cmf_dms_airflow_formatted.json
4c4
<             "urn": "urn:li:dataset:(urn:li:dataPlatform:redshift,datalake.dms_fxcu_arbor_fxcu_arbor.cmf_dms,PROD)",
---
>             "urn": "urn:li:dataset:(urn:li:dataPlatform:redshift,datalake.dms_fxcu_arbor_fxcu_arbor.cmf_dms,DEV)",
8c8
<                         "origin": "PROD",
---
>                         "origin": "DEV",
13a14,20
>                     "com.linkedin.common.BrowsePaths": {
>                         "paths": [
>                             "/dev/redshift/datalake/dms_fxcu_arbor_fxcu_arbor/cmf_dms"
>                         ]
>                     }
>                 },
>                 {
31a39
>                                 "isPartOfKey": false,
42a51
>                                 "isPartOfKey": false,
53a63
>                                 "isPartOfKey": false,
64a75
>                                 "isPartOfKey": false,
75a87
>                                 "isPartOfKey": false,
.....
.....
>                                 "isPartOfKey": false,
1252a1371
>                                 "isPartOfKey": false,
1266,1272d1384
<                     }
<                 },
<                 {
<                     "com.linkedin.common.BrowsePaths": {
<                         "paths": [
<                             "/prod/redshift/datalake/dms_fxcu_arbor_fxcu_arbor/cmf_dms"
<                         ]
`isPartOfKey`seems like it could be the culprit. Its in the json for the dataset added from airflow, but not from the cli. Was there a change in this area between
acryl-datahub==0.8.8.0
&
acryl-datahub==0.8.10.0
??
b
So origin also different.. is airflow producing the "PROD" one?
m
airflow was producing the DEV one....
b
okay got it
let me look at the isPartOfKey thing..
You mentioned you can view the PROD one in the UI, right?
m
i installed
acryl-datahub==0.8.8.0
on the airflow server and re-ran the dag... and hey-ho, i can now see the metadata on the ui 🙂
yeah, it was the PROD one i could see on the UI....
seems there's some compatability issues between those 2 levels and the ui....
b
yep - great catch
this was recently added indeed. i'll follow up to figre out a fix! thanks for reporting simon
m
nws.... thanks @big-carpet-38439 & @mammoth-bear-12532 for fantastic support - you guys are so quick to respond 🙂
🙏 1