Hello all. I have questions regarding graph_servic...
# all-things-deployment
b
Hello all. I have questions regarding graph_service_impl using Neo4j vs ES. 1. The value of graph_service_impl changed from
neo4j
to
elasticsearch
starting from v0.10.0, while the default value in subcharts are still
neo4j
. Is there any reason of the change? Also, it is mentioned in docs that
neo4j
is still the default because of backward compatibility. What is the recommended graph_service_impl from now on and going forward? 2. In what situation will Neo4j have advantage over ES? Which specific features and scenario where Neo4j will be more beneficial? Another question, what is the recommended schema registry between internal and kafka? What is the advantage and disadvantage between them?
d
@brainy-tent-14503 might be able to speak to this!
a
Yes, the elasticsearch implementation has more features and has more testing. We’re not actively maintaining feature parity with neo4j other than to review community PRs against it. The helm subcharts should be updated to reflect the same default as the higher level setting at some point.
b
How about the schema registry between internal and kafka?
a
For new instances, I would prefer internal, however if you already have an existing installation we don’t yet support migrating the existing schemas. We hope to address this limitation shortly.
b
Is there any expected effect of migrating schema registry from kafka to internal? I am migrating the existinf stateful dependencies (ES, kafka, neo4j) from kube to VM. I want to: 1. Change the schema registry from kafka to internal. I know this is kind of fresh restart for our kafka dependency. 2. Change the graph_impl from neo4j to ES. What is the expected effect of this schema registry and graph_impl change?
Actually, I have tried the migration, and ran the restore indices job. It seems that all assets are restored and displayed in the UI. However, all the ingestions seems to be gone. What is the problem here?
a
1. It is possible to migrate from schema-registry to internal. If you’re willing to purge all topic messages (or re-create them without old messages) and do not copy the
_schemas
topic to new instance. Likely you wouldn’t be bothering to copy the topic’s data anyways.
What do you mean by
ingestions
? Are you talking about previously run history of the ingestion sources? The sources themselves (I would have to check, but likely these should be in SQL).
b
@brainy-tent-14503 I mean all ingestion job in managed ingestion, in this page Edit: I just realized that I set
datahub-ingestion-cron.enabled
to false during redeployment. Does this cause all the ingestion job to be gone?
More questions, 1. What should I put in the neo4j host and uri if neo4j is not used as graph_impl?
Copy code
neo4j:
    host: ""
    uri: ""
    username: "neo4j"
    password:
      secretRef: neo4j-secrets
      secretKey: neo4j-password
2. What would you recommend for MAE and MCE consumer regarding
datahub_standalone_consumers_enabled
? Is it possible to set is as False and when scaling is needed, it is changed to True? What will be the drawback of using standalone later when scaling is needed?
a
1. Nothing needs to be defined for neo4j, when not using it. 2. For small deployments, I would not enable standalone consumers. These are primarily used when GMS is being overwhelmed by ingestion requests to the point where the UI is impacted. The benefit for scaling standalone consumers is they can be scaled to lower ingestion latency by adding parallelism using replicas up to the # of kakfa’s topic partitions.
Digging into the data model, ingestion jobs are called
dataHubIngestionSource
They are stored in SQL and indexed in elasticsearch (
datahubingestionsourceindex_v2
). They have urns similar to
urn:li:dataHubIngestionSource:0b935dbc-cbb5-4cc8-8282-827c866d3433
Running restoreIndices on these urns should theoretically restore them from SQL to the index, however I have not tested this myself.