Hi! I would like to share with you a possible issu...
# troubleshoot
w
Hi! I would like to share with you a possible issue we have found when combining stateful ingestion and transformers. This is how you can reproduce it and the evidences we have found along the way: • Set up an ingestion recipe with stateful ingestion enabled for the source + any transform (eg. add_dataset_ownership transform) • Run the recipe adding some deny pattern (just to force the soft deletion of some dataset). At this point we have checked the logs and the generated events look totally correct with two events for the affected dataset: one from the soft-removal with
Status(removed=True)
and the second event for the
Ownerhip
aspect (nothing about the status in the second upsert). • The dataset is wrongly shown in the UI as a valid dataset (not soft-deleted). We have also checked the backend and the dataset has
Status(removed=False)
. So if the issue is not during the ingestion, it must be the backend the one deciding to enable back the dataset for some reason. Looking for something supporting our assumption we have found this in the source code https://github.com/datahub-project/datahub/blob/8c8f1b987a0c9fc29f4005aa8d132ad2550f3f05/metadata-io/src/main/java/com/linkedin/metadata/entity/EntityService.java#L1097 I could be wrong but it looks like in some cases, the backend decides to set the removal flag to false. It’s like it decides to re-enable back the dataset because there are other aspects being updated. If that’s true and while it could make sense in some cases, it causes our simple use case to misbehave. WDYT? Could be that the root cause of the issue?
My teammate @stocky-energy-24880 firstly raised up the issue here https://datahubspace.slack.com/archives/CUMUWQU66/p1655463500037069 There was likely some misunderstanding at that moment about what we actually meant.
Please, @big-carpet-38439 @dazzling-judge-80093 could you have a look? If our hypothesis is correct, this issue is affecting all users combining stateful ingestion sources and transformers in a recipe.
d
Sure, will take a look
thankyou1 2
w
My team is having this issue as well, hence we wanted to use entitiesv2 endpoint to check for urn existence and status before insertion. But this endpoint will always return a result despite providing a urn that doesnt exist. It will always provide a datasetKey minmally.
Experienced the same issue as Sergio, that the status removed will flipped back to false, when updating other aspect. In our case ownership update.
w
Hi @dazzling-judge-80093, is there any update on this? The use of transforms sets back the status removed flag to false so in practice disables stateful ingestion feature. I assume use of transforms is very common so this would be impacting many users.
We have new insights on this. We noted that some connectors generate the following logs:
Copy code
/usr/local/lib/python3.8/site-packages/datahub/ingestion/transformer/add_dataset_browse_path.py:33: DeprecationWarning: Call to deprecated class DatasetTransformer. (Legacy transformer that supports transforming MCE-s using transform_one method. Use BaseTransformer directly and implement the transform_aspect method)
 return cls(config, ctx)
 /usr/local/lib/python3.8/site-packages/datahub/ingestion/transformer/add_dataset_ownership.py:174: DeprecationWarning: Call to deprecated class DatasetTransformer. (Legacy transformer that supports transforming MCE-s using transform_one method. Use BaseTransformer directly and implement the transform_aspect method)
However, there are others connector that don’t complain about deprecated classes, such as
pattern_add_dataset_schema_tags
. Our experience is stateful ingestion does not work properly with connectors showing the deprecation warning while other connectors (we have only tested the one adding tags to the schema) do work correctly. However this somehow invalidates our previous confirmation that the problem was on backend and not on ingestion. So quite confusing. WDYT?
d
Sorry, I’m busy some other stuff but it is definitely on my list to check
w
Just adding more insights in order to be helpful 😄 We have noted that deprecating functions in the transformers were introduced in this PR https://github.com/datahub-project/datahub/pull/4337 It could be that the new support for MCP in the transformers introduced some side-effect. Adding @loud-island-88694 to the loop since he was the author of that PR.
m
Thanks for flagging this @witty-butcher-82399, we're taking a look
👍 1
thankyou1 1
d
ok, I think I know what is the problem. Due to legacy reasons, we convert MCP to fake mce, and because of this, the transformer generates ownership for the delete aspects which basically reactivates the node.
And that is why you only see with legacy transformers
w
So the problem is exclusively with the ownership transform?
I don't see where the ownership transform sets back the status flag. Is that something happening in backend because of being a fake MCE?
d
it doesn’t set back
but I think because it sends an ownership DataHub think it is not deleted anymore and I think it changes the flag back
d
This is what I can see:
Copy code
{
    "auditHeader": null,
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:bigquery,myproject.partition_test.users,PROD)",
    "entityKeyAspect": null,
    "changeType": "UPSERT",
    "aspectName": "status",
    "aspect": {
        "value": "{\"removed\": true}",
        "contentType": "application/json"
    },
    "systemMetadata": {
        "lastObserved": 1656583428073,
        "runId": "bigquery-2022_06_30-12_03_31",
        "registryName": null,
        "registryVersion": null,
        "properties": null
    }
},
{
    "auditHeader": null,
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:bigquery,myproject.partition_test.users,PROD)",
    "entityKeyAspect": null,
    "changeType": "UPSERT",
    "aspectName": "ownership",
    "aspect": {
        "value": "{\"owners\": [{\"owner\": \"urn:li:corpuser:username1\", \"type\": \"PRODUCER\"}, {\"owner\": \"urn:li:corpGroup:groupname\", \"type\": \"PRODUCER\"}], \"lastModified\": {\"time\": 0, \"actor\": \"urn:li:corpuser:unknown\"}}",
        "contentType": "application/json"
    },
    "systemMetadata": {
        "lastObserved": 1656583428073,
        "runId": "bigquery-2022_06_30-12_03_31",
        "registryName": null,
        "registryVersion": null,
        "properties": null
    }
},
yeah, the ownership aspect reactivates it
w
So, what’s the conclusion so far? • The problem is exclusively with the ownership transform • Or the problem is with the transforms using the fake mce function (so all dataset transforms?) • Or the problem is with all the transforms And more important: is there an ETA for a fix? or still soon for that?
d
The issue is more with those transformers which can generate additional aspect(s) and not just change a specific aspect. I’m thinking/working on a workaround.
w
OK, I see. thankyou1
What do you mean exactly by “transformers which can generate additional aspects”? I guess there are some transformers that enrich the MCE for a given entity, by adding or updating the aspects within the event. However, there are other transforms producing one event per aspect. Is that?
d
like owner transform can add owner even if there were no owner before
as far as I know, we are considering removing the logic from the backend to not flip back the soft delete flag.
w
Adding an aspect such as owner shouldn’t be a problem, or no more problem than updating. IMO the problem is when transforms add them in different events: status aspect and owner aspect arrive separately to the backend and so backend decides to fill in the second event with the status(removed=false) aspect.
We did some tests and now I understand what you mean by “like owner transform can add owner even if there were no owner before” 👍
Hi @little-megabyte-1074 This thread is about an issue affecting stateful-ingestion which is a quite critical feature for us (and for many others I assume 😅) so I’m doing the following up. I see Tamas is off. Do you know if the team is actively working on this and what’s the status? Thanks
l
Hi @witty-butcher-82399! I believe this has been addressed in this PR & will be included in our next release! We’re a bit short-staffed this week due to some much-deserved PTO, but we should have one out within the next couple of weeks!
w
Thanks Maggie for sharing the update and the PR 👍
teamwork 1