Hello! I have a question about updating existing D...
# ingestion
a
Hello! I have a question about updating existing Dataset, for example its list of owners. Is there a way to add a new Owner to the property "owners" of Ownership aspect without listing the entire list of owners during ingestion?
b
Not currently. Such a thing will need to be added via delta: https://github.com/linkedin/datahub/blob/master/docs/what/delta.md
a
Many thanks for your quick reply! I guess I still have to explore this deltas concept...
b
Yeah it's definitely a more advanced concept and we didn't include any example in the OSS. That said, we use it fairly often for these membership alteration patterns internally.
a
I am not sure what does it mean "without listing the entire list of owners during ingestion"? say, before we have
ownership:{
'owner': ['owner A']}
after, can I just ingest like this
Copy code
ownership:{
'owner': ['owner A', 'owner B', 'owner C']}
I thought MAE will pick up the difference?
a
It will rewrite the whole 'owners' list. Deltas allow you to specify only new elements you want to add, without mentioning all the elements already present in the list. I found some PDLs and java code for UpstreamLineageDelta, but not sure how to put all pieces together, what steps are needed in general case. Any help is welcome...
b
Added https://github.com/linkedin/datahub/issues/1906 to track the documentation work. Meanwhile, the example you found in code is the best thing we have.
s
@bumpy-keyboard-50565 are you saying deltas are not supported at all in open source datahub right now, or that deltas are not well documented and they are supported for some things but not others. Asking because I'm looking at deltas for lineage. This is nice for setting "downstream" lineage since every usage of a dataset is a separate "downstream" lineage and those usages might be scattered across many places.
Some context: it looks like in the pegasus schema for the "MetadataChangeEvent" a "Delta" exists: https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataChangeEvent.pdl#L25 But the "Delta" is an empty union, which makes me think it is not implemented: https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/delta/Delta.pdl#L6
b
It is "supported" in the sense that it works if implemented correctly. There is a lack of documentation on this front. It's more involved than your usual aspects.
s
@bumpy-keyboard-50565 do you mind double checking the delta functionality is open sourced? The MXE pegasus schema I linked above shows a delta is an empty union. Additionally the MXEProcessor job does not seem to process proposed deltas and write them to GMS: https://github.com/linkedin/datahub/blob/master/metadata-jobs/mce-consumer-job/src/main/java/com/linkedin/metadata/kafka/MetadataChangeEventsProcessor.java#L54-L58
b
Got it. Most likely that part of the code wasn't open sourced properly. Let me update the issue.
s
Cool thanks for checking!
a
Hi Alex, even though a delta concept is implemented in the gms, when it comes down to how it is persisted in the MySQL database, each
ownership
is a row of record, as shown in attached. I think the whole logic is rewriting to persist. Or I might have understood something wrong.
b
Your understanding is correct. Delta does an atomic partial update before writing the whole aspect back to DB
a
So far it seems
UpstreamLineageDelta
has almost all the elements to let partial updates of
UpstreamLineage
aspect. There is
UpstreamLineageResource.deltaUpdate()
which implements the actual update of the lineages list, REST API gets generated in
com.linkedin.dataset.datasets.snapshot.json
. Completing this use case would be a good example to base on.
b
Yup that's the intention 🙂
a
Hmm, I finally found how to perform a partial update on
UpstreamLineage
aspect (currently only add/update is implemented) via REST API. Something like this does the trick:
Copy code
curl -s -H 'X-RestLi-Protocol-Version:2.0.0' -XPOST \
'<http://localhost:8080/datasets/($params:(),name:SampleHiveDataset,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Ahive)/upstreamLineage?action=deltaUpdate>' -d'{
  "delta": {
	"upstreamsToUpdate": [
	  {
		"auditStamp": {
		  "actor": "urn:li:corpuser:jdoe",
		  "time": 1581407189000
		},
		"type": "VIEW",
		"dataset": "urn:li:dataset:(urn:li:dataPlatform:hdfs,MyNewHdfsDataset,PROD)"
	  }
	]
  }
}' | jq
It returns updated
UpstreamLineage
aspect. I still have to trace all the elements needed to be done to implement this for some other aspect - pdl to define, resource class etc. Can I use Kafka API to perform partial updates or it requires more work? It would be really great if it works with Kafka too 🙂
b
Once you add
UpstreamLineageDelta
to
Delta
you can start emitting MCE with the delta info. There's one more step to register the mapping of
UpstreamLineageDelta
to the corresponding
Action
rest.li method that's currently missing in
mce-consumer-job
.
a
Well, it is a bit more complicated. Currently in
MetadataChangeEventsProcessor.java
method
consume()
handles only snapshots and it calls
processProposedSnapshot()
->
BaseRemoteWriterDAO.create()
. And class
RestliRemoteWriterDAO
, the actual implementer of abstract class
BaseRemoteWriterDAO
(which has only one method
create()
), handles only snapshots... So if I use a similar Kafka->Rest.li dao, then I would need to add another method besides
create()
to handle deltas. Am I on the right way?
b
Actually it's a bit more involved than that. You'll use the client generated by gms to invoke the specific
Action
method of your choice. See this for more details: https://linkedin.github.io/rest.li/user_guide/restli_client
a
Hello, I was busy on other things last several weeks, now came back to Datahub. I finally implemented needed pieces of code for delta updates via Kafka. Strangely, PARTIAL_UPDATE Res.li RequestBuilders aren't generated, maybe there is an option somewhere to activate. So I implemented it myself and it doesn't add too much code indeed, will prepare a PR for UpstreamLineage as an example. I have another question - can we have a complete human-readable documentation generated somewhere for all our metadata handled in Datahub? Apparently, Pegasus has this feature, but I don't see how to activate it...
b
Thanks. Looking forward to the PR. A corresponding RequestBuilder should be generated assuming you have the appropriate annotation. For Pegasus documentation, I'm only aware of it in the context of rest.li endpoints. Let me dig more.