This more granular implementation and update of as...
# ingestion
c
This more granular implementation and update of aspects. Is this something which is on the roadmap? Because I remembered some posts about this? Because from my understanding whenever 2 independent projects push for example extra properties to an entity, the last one wins. So this is kind of a blocker for a distributed push based approach where each team could push some properties (and have ownership over that part of the data) without overwriting the other team's properties. (Or take that we as a central team push some properties and we want to allow other teams to enhance it with whatever information they think is relevant.) Or is there another approach we can take wrt this? (Or is this an anti pattern?)
b
Hi Adriaan. Yes, you're right. In order to support finer-grained updates we'd need a way to represent sub-aspect-level patches. While this is something we'd ideally want, it's a nontrivial amount of work & complexity to arrive at it. We want to ensure that the effort would be well spent, so we're still in the process of understanding use cases around this, and where things work vs. don't work well when it comes to pushing metadata. Is this a use case you folks are going to have? One (non-ideal) approach is to do a read-write loop to make updates to a particular aspect
c
Yes it is an issue that we are starting to face. The problem we face now is that we currently have a good baseline of data in for some platforms but would like to enrich it with information coming from other platforms. Examples would be for example the information from a Glue job or the lastrun date of it. So to avoid deleting our already ingested properties this would be extremely useful. And then to distribute away the ownership it would make sense that we can merge these aspects. Otherwise it will be pretty difficult for multiple teams to enrich parts of the same metadata. (Like I said, we as the central teams and some data owners who might have some extra information which they want to add.) A solution is indeed a read-write loop. I'll look into that this week. šŸ™‚ But I'm curious, didn't such a use case never pop up before? Because this seems like a common problem when everything works with a push based approach? Because when I pull and use the ingestion framework, I can just add as many transformers as I want, so then it's fine.
b
It has come up. The initially proposed solution was something called a "Delta" which was basically a patch model for a particular aspect. This model encoded both the data that needed to change on a particular update type as well as the operation that needed to be performed, defined in a custom manner on a per-aspect basis. For example, you could have a patch model for dataset "properties" that looks like this:
Copy code
DatasetPropertiesDelta
{
   addEntries: [ { key: "myNewKey", value: "myNewValue" } ], 
   removeEntriesWithKey: [ "key1", "key2" ] 
}
However, the downside with this approach is that it's not easily generalizable. For each aspect that needs partial updates, you need to define a new "delta" model. Further, you have to hand-write logic to merge a "delta" received over Kafka or via Rest with the existing aspect. This basically comes down to putting a read-merge-write loop on the server, instead of the client side. If you have many clients that need to do this, it can make sense. But comes with high cost on the server maintainers. A more desirable solution would be a true patch model wherein we support field-level operations to be specified, for example:
Copy code
DatasetProperties Patch 
{
   properties: { $add: [{ "key": "value" }] } 
}
But this is much tricker to achieve in a generalizable way... and Rest.li / PDL has no built-in help šŸ˜ž (To be fair, nor does Protobuf or Avro)
c
Interesting, thanks for the explanation! But then I'll try to add something on my side then and see how it works. We are working on something to expose a more high-level / customised API for our data producers, so it can fit quite well in there. And as for now it's only for the custom properties that we need this, I think it will be relatively straightforward for us to add, even if it would be not so clean. šŸ™‚
b
Got it - Well let us know how it goes. We are definitely interested in solving this in a more generalizable way šŸ™‚ The toughest cases are exactly what you're working with: collections like maps and arrays which cannot be easily replaced in full
a
Hello, has there been any update or solution to this problem? We are dealing with similar issues while trying to allow multiple writers to update the same aspect(s)
c
@big-carpet-38439 curious if there is anything planned to address this problem where multiple sources/writers may want to update the same aspect. this thread seemed to have a possible solution proposal of keeping track of a map of <writerURN, fieldValue> for each field within an aspect.. https://datahubspace.slack.com/archives/C01HPV8EKTK/p1618847271014600
w
As a user highly interested on this topic, just sharing an idea: Multiple writers exist from the very beginning, by having both backend ingestors and the UI. At that moment, the issue was fixed by separating the write path for each writer, so we have eg DatasetProperties and EditableDatasetProperties, and solving the merge are read time. In this case, by having more priority updates from EditableDatasetProperties. We may scale this solution to solve N>2 writers: • An aspect is associated with a writer (some ideas: ingestion pipeline, api token, …) so multiple perspectives/views of that aspect are actually stored in separated write paths. • Merge strategy is decided at read time: read all, merge, latest, … Would something like that work? šŸ¤”
c
Something like that seems right...although when implemented for aspects like GlossaryTerms and GlobalTags I wonder how that would affect search. You'd only want the "winner" as chosen by the merge strategy to get indexed for search
@big-carpet-38439 any thoughts or plans around this?
b
So the only plans around this are to support aspect-writes based on a version identifier via the GMS ingestion APIs
This RFC from our friends at G-Research details the challenge and the proposal -> https://github.com/datahub-project/datahub/pulls?q=is%3Apr+is%3Aopen+RFC
Separately we do have basic Patch support, the only trickiness is that it has some level of semantics so it's not truly no-code generalizable. @orange-night-91387 can detail that more
We've considered solutions like what @witty-butcher-82399 has proposed, but found them to ultimately increase the complexity / understandability of the system dramatically. Ie the trickiest part is defining a read-time merge strategy that actually makes sense all the time. For the same reason git merges require human input, we would need some way to define a "merge conflict policy", for example by establishing an explicit priority level for each writer and have the "highest priority writer" always win
c
yeah, I see we can patch or do a read-update-write. but in many cases I want a given ingestion source to do a full overwrite of the fields within an Aspect that it originally wrote. For example, if SourceA emitted Tag1, Tag2 and SourceB emitted Tag3....if SourceA later changes to emit only Tag4 I would want the resulting GlobalTag aspect to contain just Tag3, Tag4 as far as i can tell we aren't keeping enough information in the metadata model to accomplish this
b
That's correct, we are not
Yeah you want attribution of changes at a very very fine-grained level (e.g. single entries in a list level)
c
right
b
I'm wondering if this case is most notable when there are relationships involved
Like what you described
c
currently we are trying to solve for this with the Ownership, GlobalTag, and GlossaryTerm arrays. since our org has different sources that maintain some of that data. but I dont necessarily think it will be limited to those in the future
BTW I realize this is likely a large effort to solve for generically, just wondering if its planned (sounds like no) and if not how we could advocate/help design for it
if I'm reading this RFC correctly, it is related (sounds kind of like advanced optimistic locking?) but wouldn't address the issue above where we need to know which "writer" owns each field/value https://github.com/datahub-project/datahub/pull/5818
b
Yes that's true
It's only in this ballpark, not exactly this issue
So yeah I think having explicit ownership of edges is a bit of a larger initiative, but we'd definitely entertain ideas around it. Internally, we've had many debates about it šŸ™‚
a
Do you have any news? @big-carpet-38439