We re thinking about the level of persistence needed especia DataHub #getting-started

We’re thinking about the level of persistence need...

high-hospital-85984

10/09/2020, 8:52 AM

We’re thinking about the level of persistence needed especially form Neo4J, as we might end up hosting it ourselves. In case we need to regenerate the data in Neo4J, whats the best approach? I guess we can keep a full history in the Kafka-topic, and reset the cursor manually, but it feels suboptimal from a scaling perspective? Is there a way to tell GMS to resend the MAE messages based on whats in MySQL? Backups are of course nice, but we see a risk of the ES and Neo4J backups getting out of sync in case of a disaster. Therefore, it would be nice to have a way to repopulate the DBs as a fallback. Or maybe we’re just overthinking this 😅

bumpy-keyboard-50565

10/09/2020, 11:36 AM

There is a per-URN backfill API available (e.g. https://github.com/linkedin/datahub/blob/master/gms/impl/src/main/java/com/linkedin/metadata/resources/dataset/Datasets.java#L242). That said, we're adding the ability to mass backfill so the API is subjected to change in the near future and that's why it's not well documented now. @steep-airplane-62865 can shed more light here.

👍 1

high-hospital-85984

10/09/2020, 1:20 PM

~~Wrong thread?~~ ☝️

high-hospital-85984

10/09/2020, 1:22 PM

Thanks @bumpy-keyboard-50565 I like the sound of that backfill possibility!

➕ 1

bumpy-keyboard-50565

10/09/2020, 1:43 PM

Sorry my bad. Early morning 😛

mammoth-bear-12532

10/09/2020, 5:31 PM

@high-hospital-85984: We ETL the Metadata topics to data lake... so that is always there as a way to "backfill"

3 Views

Open in Slack

Previous Next