Hello all I have questions regarding graph service impl usin DataHub #all-things-deployment

Hello all. I have questions regarding graph_servic...

boundless-piano-94348

06/08/2023, 8:56 AM

Hello all. I have questions regarding graph_service_impl using Neo4j vs ES. 1. The value of graph_service_impl changed from

neo4j

elasticsearch

starting from v0.10.0, while the default value in subcharts are still

neo4j

. Is there any reason of the change? Also, it is mentioned in docs that

neo4j

is still the default because of backward compatibility. What is the recommended graph_service_impl from now on and going forward? 2. In what situation will Neo4j have advantage over ES? Which specific features and scenario where Neo4j will be more beneficial? Another question, what is the recommended schema registry between internal and kafka? What is the advantage and disadvantage between them?

delightful-ram-75848

06/09/2023, 1:52 AM

@brainy-tent-14503 might be able to speak to this!

aloof-gpu-11378

06/09/2023, 11:33 PM

Yes, the elasticsearch implementation has more features and has more testing. We’re not actively maintaining feature parity with neo4j other than to review community PRs against it. The helm subcharts should be updated to reflect the same default as the higher level setting at some point.

boundless-piano-94348

06/10/2023, 1:16 AM

How about the schema registry between internal and kafka?

aloof-gpu-11378

06/10/2023, 10:48 PM

For new instances, I would prefer internal, however if you already have an existing installation we don’t yet support migrating the existing schemas. We hope to address this limitation shortly.

boundless-piano-94348

06/12/2023, 2:37 AM

Is there any expected effect of migrating schema registry from kafka to internal? I am migrating the existinf stateful dependencies (ES, kafka, neo4j) from kube to VM. I want to: 1. Change the schema registry from kafka to internal. I know this is kind of fresh restart for our kafka dependency. 2. Change the graph_impl from neo4j to ES. What is the expected effect of this schema registry and graph_impl change?

boundless-piano-94348

06/12/2023, 3:09 AM

Actually, I have tried the migration, and ran the restore indices job. It seems that all assets are restored and displayed in the UI. However, all the ingestions seems to be gone. What is the problem here?

aloof-gpu-11378

06/12/2023, 3:25 PM

1. It is possible to migrate from schema-registry to internal. If you’re willing to purge all topic messages (or re-create them without old messages) and do not copy the

_schemas

topic to new instance. Likely you wouldn’t be bothering to copy the topic’s data anyways.

aloof-gpu-11378

06/12/2023, 3:27 PM

What do you mean by

ingestions

? Are you talking about previously run history of the ingestion sources? The sources themselves (I would have to check, but likely these should be in SQL).

boundless-piano-94348

06/14/2023, 1:12 AM

@brainy-tent-14503 I mean all ingestion job in managed ingestion, in this page Edit: I just realized that I set

datahub-ingestion-cron.enabled

to false during redeployment. Does this cause all the ingestion job to be gone?

boundless-piano-94348

06/14/2023, 2:26 AM

More questions, 1. What should I put in the neo4j host and uri if neo4j is not used as graph_impl?

Copy code

neo4j:
    host: ""
    uri: ""
    username: "neo4j"
    password:
      secretRef: neo4j-secrets
      secretKey: neo4j-password

2. What would you recommend for MAE and MCE consumer regarding

datahub_standalone_consumers_enabled

? Is it possible to set is as False and when scaling is needed, it is changed to True? What will be the drawback of using standalone later when scaling is needed?

aloof-gpu-11378

06/14/2023, 10:10 PM

1. Nothing needs to be defined for neo4j, when not using it. 2. For small deployments, I would not enable standalone consumers. These are primarily used when GMS is being overwhelmed by ingestion requests to the point where the UI is impacted. The benefit for scaling standalone consumers is they can be scaled to lower ingestion latency by adding parallelism using replicas up to the # of kakfa’s topic partitions.

aloof-gpu-11378

06/14/2023, 10:16 PM

Digging into the data model, ingestion jobs are called

dataHubIngestionSource

They are stored in SQL and indexed in elasticsearch (

datahubingestionsourceindex_v2

). They have urns similar to

urn:li:dataHubIngestionSource:0b935dbc-cbb5-4cc8-8282-827c866d3433

Running restoreIndices on these urns should theoretically restore them from SQL to the index, however I have not tested this myself.

Open in Slack

Previous Next