Hello, does datahub support deleting concrete enti...
# getting-started
i
Hello, does datahub support deleting concrete entities? From https://github.com/linkedin/datahub/tree/master/gms I see get/search/update & list but no delete.
b
This is a very important question... @microscopic-receptionist-23548 is supported anywhere? Is there a recommended approach? The closest thing I can think of if there's no formal delete is to ingest an empty snapshot
which would effectively replace all of the metadata that exists about an urn with nothing
i
IIRC rest li already supports deletes, it’s just that the gms clients have a not implemented exception as placeholders. How hard would it be to implement the logic and propagate that delete to all 3 databases?
b
Yes i think it's just that the BaseClient class doesn't declare a delete method.. we'd have to add one and then probably just implement some base class logic to delete across each store
m
usually we just have a
Status
aspect and set
removed
to
true
soft deletion
i
For my use-case soft deletion is not really what we want. Essentially we found a bug in the ingestion framework and populated datahub incorrectly for some entities. Now we want to rollback/delete those inserted entities
b
@microscopic-receptionist-23548 Do you know if updating status to removed will purge elastic docs and neo nodes?
m
it will not
we rebuild indices every so often at LI from scratch
(with the way lucene works this is actually "better"; iirc even ES says to "delete indices not documents")
i
Datahub's documentation makes reference to a garbage collection logic for metadata that has been tagged as status.deleted = true after a while. Is there any version of this available, internally at linkedin or otherwise? I think that could fit the use-case here by manually executing the garbage collection and parameterizing the time-since-marked-as-deleted variable.
sadly this seems to be more of a marker interface (
Retention
has no methods) and is thus limited without edits to GMA 😐
this retention is also only about purging the actual DB not secondary indices, afaik. I'd highly suggest you just do that manually, if this is a one time thing
e.g. with a sql query
i
For now I guess manual deletes will have to do. Is there a reason why purging can’t also affect secondary indices?
m
see above for elastic; that's not really how elastic is meant to work. I'm not sure about neo4j @steep-airplane-62865
(I'm not saying we can't set it up for elastic, just that its less than ideal)
i
Just trying to figure what is not ideal about deleting. Is this an elastic search limitation?
m
You can do it, it just isn't recommended iirc. Elastic builds off lucene, which uses SSTables
In either case the efficiency concern is not why we don't have it. We probably don't have it due to a previous lack of a need for it.
s
For secondary stores (both ES and Neo4j), I would suggest just recreating the indices. For MySQL, you can manually delete the data using a SQL script.
Afaik, no support for metadata hard delete is an intentional decision because most of the time, you don't want to lose valuable metadata.