Hi Everyone, trying to Add lineage, between Data j...
# ingestion
h
Hi Everyone, trying to Add lineage, between Data job and dataset (specifically S3 location ). Is there a convention to follow for the S3 path ( what is usually present in AWS ) . I see that a dot convention has been used in S3 samples ingested in the demo datahub.
e
It should match whatever is in s3 like urnlidataset:(urnlidataPlatform:s3,project/root/events/logging_events_bckp,PROD)
These have this syntax bc these are backups of the snowflake dataset of the same name. If you take a look at the urn, it still has the full path than the final name.
h
Thanks for clarifying @early-lamp-41924 !
@early-lamp-41924 - I trying to ingest lineage between S3 and a glue dataset. Though the lineage emission seems successful, I am unable to view the S3 dataset or the lineage
e
was the s3 dataset ingested?
can you try fetching the “upstreamLineage” aspect for the glue dataset?
h
No the S3 dataset was not ingested. Would this not create a dataset placeholder ?
@early-lamp-41924
GMS Logs -
Copy code
20:55:05.509 [mae-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataAuditEventsProcessor:109 - Message: {"auditHeader": null, "oldSnapshot": {"urn": "urn:li:dataset:(urn:li:dataPlatform:glue,dbname.table1,PROD)", "aspects": [{"platform": "urn:li:dataPlatform:glue", "name": "dbname.table2", "origin": "PROD"}, {"upstreams": [{"auditStamp": {"time": 0, "actor": "urn:li:corpuser:unknown", "impersonator": null}, "dataset": "urn:li:dataset:(urn:li:dataPlatform:s3,['s3', 's3path/part-00000-6520dc45-b7b7-42df-95bd-a95c4a2b4220-c000.snappy.parquet'],PROD),", "type": "TRANSFORMED"}]}]}, "oldSystemMetadata": {"lastObserved": 1645735449539, "runId": null, "registryName": null, "registryVersion": null, "properties": null}, "newSnapshot": {"urn": "urn:li:dataset:(urn:li:dataPlatform:glue,dbname.table1,PROD)", "aspects": [{"platform": "urn:li:dataPlatform:glue", "name": "dbname.table2", "origin": "PROD"}, {"upstreams": [{"auditStamp": {"time": 0, "actor": "urn:li:corpuser:unknown", "impersonator": null}, "dataset": "urn:li:dataset:(urn:li:dataPlatform:s3,['s3', 's3path.part-00000-6520dc45-b7b7-42df-95bd-a95c4a2b4220-c000.snappy.parquet'],PROD),", "type": "TRANSFORMED"}]}]}, "newSystemMetadata": {"lastObserved": 1645736105486, "runId": null, "registryName": null, "registryVersion": null, "properties": null}, "operation": "UPDATE"}
e
this urn looks a bit off? urnlidataset:(urnlidataPlatform:s3,[‘s3’, ‘s3path.part-00000-6520dc45-b7b7-42df-95bd-a95c4a2b4220-c000.snappy.parquet’],PROD),
@orange-night-91387 is this the new data platform instance urn?
h
We are on 8.19 version
I also see this error - 204409.562 [mae-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataAuditEventsProcessor:108 - Error deserializing message: java.lang.RuntimeException: Failed to execute method for class [com.linkedin.dataset.Upstream], field [dataset]
o
@early-lamp-41924 Wouldn't be if they're using patch 19, platform instance didn't go in until patch 25.
e
Can it be that we have a version mismatch between the ingestion client library and the server?
what is the client version you are using @handsome-football-66174?
h
@early-lamp-41924 - Using 8.19 version
@orange-night-91387 / @early-lamp-41924 - let me know if we can connect on this
e
Hmn then I have no idea how that urn came to be. Is this a custom ingestion source?
So client meaning, your python version is on 8.19 right?
h
@early-lamp-41924 -
Copy code
pip3 list
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see <https://github.com/pypa/pip/issues/5599> for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Package        Version
--------------------- -----------
acryl-datahub     0.8.19.1
Python version is 2.7.18
e
oh interesting
thought we only allow python3
what is the connector you are using to get this lineage?
h
Rest Emitter
python3 --version Python 3.7.9
e
Ah so you are writing this logic right?
so seems like your glue urn is wrong
can you check it?
not glue
s3
take a look at “urnlidataset:(urnlidataPlatform:s3,[‘s3’, ‘s3path.part-00000-6520dc45-b7b7-42df-95bd-a95c4a2b4220-c000.snappy.parquet’],PROD),”
there shouldn’t be a list there so probably should be
urnlidataset:(urnlidataPlatform:s3,s3path.part-00000-6520dc45-b7b7-42df-95bd-a95c4a2b4220-c000.snappy.parquet,PROD)
h
@early-lamp-41924 -
Copy code
19:05:00.961 [pool-8-thread-1] INFO c.l.m.filter.RestliLoggingFilter:56 - POST /entities?action=ingest - ingest - 200 - 26ms
19:05:00.964 [mae-consumer-job-client-0-C-1] INFO c.l.m.k.MetadataAuditEventsProcessor:100 - {urn=urn:li:dataset:(urn:li:dataPlatform:glue,covid_19.enigma_aggregation_global_countries,PROD), aspects=[{com.linkedin.metadata.key.DatasetKey={name=covid_19.enigma_aggregation_global_countries, platform=urn:li:dataPlatform:glue, origin=PROD}}, {com.linkedin.dataset.UpstreamLineage={upstreams=[{auditStamp={actor=urn:li:corpuser:unknown, time=0}, type=TRANSFORMED, dataset=urn:li:dataset:(urn:li:dataPlatform:s3,samples3bucket/cp,PROD),}]}}]}
19:05:00.965 [mae-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataAuditEventsProcessor:186 - Error in getting documents from snapshot: java.lang.RuntimeException: Failed to execute method for class [com.linkedin.dataset.Upstream], field [dataset] for snapshot {urn=urn:li:dataset:(urn:li:dataPlatform:glue,covid_19.enigma_aggregation_global_countries,PROD), aspects=[{com.linkedin.metadata.key.DatasetKey={name=covid_19.enigma_aggregation_global_countries, platform=urn:li:dataPlatform:glue, origin=PROD}}, {com.linkedin.dataset.UpstreamLineage={upstreams=[{auditStamp={actor=urn:li:corpuser:unknown, time=0}, type=TRANSFORMED, dataset=urn:li:dataset:(urn:li:dataPlatform:s3,samples3bucket/cp,PROD),}]}}]}
19:05:00.966 [mae-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataAuditEventsProcessor:108 - Error deserializing message: java.lang.RuntimeException: Failed to execute method for class [com.linkedin.dataset.Upstream], field [dataset]
19:05:00.967 [mae-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataAuditEventsProcessor:109 - Message: {"auditHeader": null, "oldSnapshot": null, "oldSystemMetadata": null, "newSnapshot": {"urn": "urn:li:dataset:(urn:li:dataPlatform:glue,covid_19.enigma_aggregation_global_countries,PROD)", "aspects": [{"platform": "urn:li:dataPlatform:glue", "name": "covid_19.enigma_aggregation_global_countries", "origin": "PROD"}, {"upstreams": [{"auditStamp": {"time": 0, "actor": "urn:li:corpuser:unknown", "impersonator": null}, "dataset": "urn:li:dataset:(urn:li:dataPlatform:s3,samples3bucket/cp,PROD),", "type": "TRANSFORMED"}]}]}, "newSystemMetadata": {"lastObserved": 1645815900935, "runId": null, "registryName": null, "registryVersion": null, "properties": null}, "operation": "UPDATE"}
19:05:01.008 [qtp544724190-62] INFO c.l.metadata.entity.EntityService:386 - INGEST urn urn:li:dataFlow:(airflow,mdms_dataset_to_dataset_lineages,prod) with system metadata {lastObserved=1645815901008}
19:05:01.040 [pool-8-thread-1] INFO c.l.m.filter.RestliLoggingFilter:56 - POST /entities?action=ingest - ingest - 200 - 32ms
19:05:01.047 [qtp544724190-10] INFO c.l.metadata.entity.EntityService:386 - INGEST urn urn:li:dataJob:(urn:li:dataFlow:(airflow,mdms_dataset_to_dataset_lineages,prod),ingest_dataset_to_dataset_lineage) with system metadata {lastObserved=1645815901046}
plus1 1
e
so with ^ it still doesn’t show lineage?
h
Nope
o
Copy code
field [dataset] for snapshot {urn=urn:li:dataset:(urn:li:dataPlatform:glue,covid_19.enigma_aggregation_global_countries,PROD)

dataset=urn:li:dataset:(urn:li:dataPlatform:s3,samples3bucket/cp,PROD)
It makes sense it doesn't have lineage if it's failing to deserialize the upstream lineage aspect
I don't think the weird looking s3 Urn is the problem here, looks like it's failing for a different urn
h
@orange-night-91387 Yes thats true, tried with other datasets, they are also not emitting lineage. Still getting
Copy code
21:07:50.753 [mae-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataAuditEventsProcessor:186 - Error in getting documents from snapshot: java.lang.RuntimeException: Failed to execute method for class [com.linkedin.dataset.Upstream], field [dataset] for snapshot {urn=urn:li:dataset:(urn:li:dataPlatform:glue,db.table1,PROD), aspects=[{com.linkedin.metadata.key.DatasetKey={name=db.table1, platform=urn:li:dataPlatform:glue, origin=PROD}}, {com.linkedin.dataset.UpstreamLineage={upstreams=[{auditStamp={actor=urn:li:corpuser:unknown, time=0}, type=TRANSFORMED, dataset=urn:li:dataset:(urn:li:dataPlatform:glue,db.table1,PROD),}]}}]}
21:07:50.754 [mae-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataAuditEventsProcessor:108 - Error deserializing message: java.lang.RuntimeException: Failed to execute method for class [com.linkedin.dataset.Upstream], field [dataset]
@orange-night-91387 - Do you have some time to connect on this ? Tried clearing the Elasticsearch indices and data and restarted the application. Still getting the same issue.
@orange-night-91387 - NVM, I modified to use the Metadata Change Proposal Wrapper , instead of the Metadata Change Event. Able to emit dataset to dataset lineage without issues.
🎉 1