This more granular implementation and update of aspects Is t DataHub #ingestion

This more granular implementation and update of as...

calm-sunset-28996

07/07/2021, 12:58 PM

This more granular implementation and update of aspects. Is this something which is on the roadmap? Because I remembered some posts about this? Because from my understanding whenever 2 independent projects push for example extra properties to an entity, the last one wins. So this is kind of a blocker for a distributed push based approach where each team could push some properties (and have ownership over that part of the data) without overwriting the other team's properties. (Or take that we as a central team push some properties and we want to allow other teams to enhance it with whatever information they think is relevant.) Or is there another approach we can take wrt this? (Or is this an anti pattern?)

big-carpet-38439

07/07/2021, 2:03 PM

Hi Adriaan. Yes, you're right. In order to support finer-grained updates we'd need a way to represent sub-aspect-level patches. While this is something we'd ideally want, it's a nontrivial amount of work & complexity to arrive at it. We want to ensure that the effort would be well spent, so we're still in the process of understanding use cases around this, and where things work vs. don't work well when it comes to pushing metadata. Is this a use case you folks are going to have? One (non-ideal) approach is to do a read-write loop to make updates to a particular aspect

calm-sunset-28996

07/07/2021, 6:25 PM

Yes it is an issue that we are starting to face. The problem we face now is that we currently have a good baseline of data in for some platforms but would like to enrich it with information coming from other platforms. Examples would be for example the information from a Glue job or the lastrun date of it. So to avoid deleting our already ingested properties this would be extremely useful. And then to distribute away the ownership it would make sense that we can merge these aspects. Otherwise it will be pretty difficult for multiple teams to enrich parts of the same metadata. (Like I said, we as the central teams and some data owners who might have some extra information which they want to add.) A solution is indeed a read-write loop. I'll look into that this week. 🙂 But I'm curious, didn't such a use case never pop up before? Because this seems like a common problem when everything works with a push based approach? Because when I pull and use the ingestion framework, I can just add as many transformers as I want, so then it's fine.

big-carpet-38439

07/07/2021, 7:53 PM

It has come up. The initially proposed solution was something called a "Delta" which was basically a patch model for a particular aspect. This model encoded both the data that needed to change on a particular update type as well as the operation that needed to be performed, defined in a custom manner on a per-aspect basis. For example, you could have a patch model for dataset "properties" that looks like this:

Copy code

DatasetPropertiesDelta
{
   addEntries: [ { key: "myNewKey", value: "myNewValue" } ], 
   removeEntriesWithKey: [ "key1", "key2" ] 
}

However, the downside with this approach is that it's not easily generalizable. For each aspect that needs partial updates, you need to define a new "delta" model. Further, you have to hand-write logic to merge a "delta" received over Kafka or via Rest with the existing aspect. This basically comes down to putting a read-merge-write loop on the server, instead of the client side. If you have many clients that need to do this, it can make sense. But comes with high cost on the server maintainers. A more desirable solution would be a true patch model wherein we support field-level operations to be specified, for example:

Copy code

DatasetProperties Patch 
{
   properties: { $add: [{ "key": "value" }] } 
}

But this is much tricker to achieve in a generalizable way... and Rest.li / PDL has no built-in help 😞 (To be fair, nor does Protobuf or Avro)

calm-sunset-28996

07/07/2021, 9:08 PM

Interesting, thanks for the explanation! But then I'll try to add something on my side then and see how it works. We are working on something to expose a more high-level / customised API for our data producers, so it can fit quite well in there. And as for now it's only for the custom properties that we need this, I think it will be relatively straightforward for us to add, even if it would be not so clean. 🙂

big-carpet-38439

07/07/2021, 9:14 PM

Got it - Well let us know how it goes. We are definitely interested in solving this in a more generalizable way 🙂 The toughest cases are exactly what you're working with: collections like maps and arrays which cannot be easily replaced in full

aloof-ram-72401

10/23/2022, 2:14 AM

Hello, has there been any update or solution to this problem? We are dealing with similar issues while trying to allow multiple writers to update the same aspect(s)

cuddly-dinner-641

10/24/2022, 2:22 PM

@big-carpet-38439 curious if there is anything planned to address this problem where multiple sources/writers may want to update the same aspect. this thread seemed to have a possible solution proposal of keeping track of a map of <writerURN, fieldValue> for each field within an aspect.. https://datahubspace.slack.com/archives/C01HPV8EKTK/p1618847271014600

witty-butcher-82399

10/24/2022, 4:00 PM

As a user highly interested on this topic, just sharing an idea: Multiple writers exist from the very beginning, by having both backend ingestors and the UI. At that moment, the issue was fixed by separating the write path for each writer, so we have eg DatasetProperties and EditableDatasetProperties, and solving the merge are read time. In this case, by having more priority updates from EditableDatasetProperties. We may scale this solution to solve N>2 writers: • An aspect is associated with a writer (some ideas: ingestion pipeline, api token, …) so multiple perspectives/views of that aspect are actually stored in separated write paths. • Merge strategy is decided at read time: read all, merge, latest, … Would something like that work? 🤔

cuddly-dinner-641

10/24/2022, 7:43 PM

Something like that seems right...although when implemented for aspects like GlossaryTerms and GlobalTags I wonder how that would affect search. You'd only want the "winner" as chosen by the merge strategy to get indexed for search

cuddly-dinner-641

10/26/2022, 4:50 PM

@big-carpet-38439 any thoughts or plans around this?

big-carpet-38439

10/26/2022, 4:52 PM

So the only plans around this are to support aspect-writes based on a version identifier via the GMS ingestion APIs

big-carpet-38439

10/26/2022, 4:52 PM

This RFC from our friends at G-Research details the challenge and the proposal -> https://github.com/datahub-project/datahub/pulls?q=is%3Apr+is%3Aopen+RFC

big-carpet-38439

10/26/2022, 4:53 PM

Separately we do have basic Patch support, the only trickiness is that it has some level of semantics so it's not truly no-code generalizable. @orange-night-91387 can detail that more

big-carpet-38439

10/26/2022, 4:55 PM

We've considered solutions like what @witty-butcher-82399 has proposed, but found them to ultimately increase the complexity / understandability of the system dramatically. Ie the trickiest part is defining a read-time merge strategy that actually makes sense all the time. For the same reason git merges require human input, we would need some way to define a "merge conflict policy", for example by establishing an explicit priority level for each writer and have the "highest priority writer" always win

cuddly-dinner-641

10/26/2022, 4:56 PM

yeah, I see we can patch or do a read-update-write. but in many cases I want a given ingestion source to do a full overwrite of the fields within an Aspect that it originally wrote. For example, if SourceA emitted Tag1, Tag2 and SourceB emitted Tag3....if SourceA later changes to emit only Tag4 I would want the resulting GlobalTag aspect to contain just Tag3, Tag4 as far as i can tell we aren't keeping enough information in the metadata model to accomplish this

big-carpet-38439

10/26/2022, 4:56 PM

That's correct, we are not

big-carpet-38439

10/26/2022, 4:57 PM

Yeah you want attribution of changes at a very very fine-grained level (e.g. single entries in a list level)

cuddly-dinner-641

10/26/2022, 4:57 PM

right

big-carpet-38439

10/26/2022, 4:58 PM

I'm wondering if this case is most notable when there are relationships involved

big-carpet-38439

10/26/2022, 4:58 PM

Like what you described

cuddly-dinner-641

10/26/2022, 5:00 PM

currently we are trying to solve for this with the Ownership, GlobalTag, and GlossaryTerm arrays. since our org has different sources that maintain some of that data. but I dont necessarily think it will be limited to those in the future

cuddly-dinner-641

10/26/2022, 5:01 PM

BTW I realize this is likely a large effort to solve for generically, just wondering if its planned (sounds like no) and if not how we could advocate/help design for it

cuddly-dinner-641

10/26/2022, 5:06 PM

if I'm reading this RFC correctly, it is related (sounds kind of like advanced optimistic locking?) but wouldn't address the issue above where we need to know which "writer" owns each field/value https://github.com/datahub-project/datahub/pull/5818

big-carpet-38439

10/26/2022, 5:07 PM

Yes that's true

big-carpet-38439

10/26/2022, 5:07 PM

It's only in this ballpark, not exactly this issue

big-carpet-38439

10/26/2022, 5:08 PM

So yeah I think having explicit ownership of edges is a bit of a larger initiative, but we'd definitely entertain ideas around it. Internally, we've had many debates about it 🙂

average-nail-72662

05/11/2023, 1:15 PM

Do you have any news? @big-carpet-38439

Open in Slack

Previous Next