:wave: Hey folks. We've been trying to build some ...
# troubleshoot
b
👋 Hey folks. We've been trying to build some functionalities on top of lineage search (the
searchAcrossLineage
graphQL query) and we've been seeing super high latencies (> 10s) which executing that query. We spent some time digging into things and it looks like we're spending the bulk of our time in the getLineage call in the ESGraphQueryDao class (as we use ES as our graph store too). I did find one minor bug in that the search lineage results were meant to be cached but that is actually not being done - https://github.com/datahub-project/datahub/pull/5892. This does help us fix repeated calls for the same URN, but first time calls are still taking a while. Does anyone have any recommendations on how we could tune / speed things up here? Ballparks wise our
graph_service_v1
index has around 36M docs (4.8GB on disk) and is currently running 1 shard and 1 replica (wonder if this is too low)
l
Hi @bitter-lizard-32293 - thanks for opening a PR! In terms of tuning recommendations - I’ll defer to @mammoth-bear-12532 & @orange-night-91387
o
Sharding recommendations from Elastic suggest starting to increase shards in the 20-40 GB range, so I'm not sure that sharding will make a significant impact here. Replicas can if query load is high, but for low query load I wouldn't expect low numbers of queries with high latency to be significantly improved. How many levels of lineage do the entities that are resulting in slow queries have?
b
Yeah I did test with more replicas and it didn't help much. I tried to run:
Copy code
query getDatasetUpstreams($urn: String!) {
      searchAcrossLineage(input: {urn: $urn, direction: UPSTREAM, count: 1000, types:DATA_JOB}) {
        total
        searchResults {
          degree
          entity {
            type
            urn
          }
        }
      }
}
for one of our entities and I see degree going up to 5. Total number of results is 17:
Copy code
{
  "data": {
    "searchAcrossLineage": {
      "total": 17,
      "searchResults": [
        {
          "degree": 3,
...
(Our query load is quite low - this is mainly powering a page in the UI and we have < 20 active users at the moment)
What are the typical latencies you folks have seen for large ish graphs on these lineage queries with ES? I wonder if we are trying to push a technology here that isn't suited for this use-case and instead should explore hooking up something like Neo4J / Neptune? It does entail a bit of work so I'm trying to see if there's a way we can get the ES setup to work though..
o
We're generally seeing a few seconds as the max rather than double digit seconds, but scale can vary pretty widely. We do have integrations with Neo4J and DGraph, but like you said they do take some effort to stand up and Neo4J's licensing can be annoying.
Neo4J vs Elastic also has significant implications on ingestion time from my experience
b
yeah I've been looking into hooking up neptune as it seems like it is sorta compatible with opencypher
Neo4J vs Elastic also has significant implications on ingestion time from my experience
Interesting, is neo a lot slower than ES?
We're generally seeing a few seconds as the max
So it's likely the graphs we're hitting have maybe a larger / different shape that is resulting in things being a little slower. But given you have seen O(seconds) it doesn't seem like we're too far off. Are there any plans to try and optimize this a bit? I'm waiting on one of our infra teams giving us the go ahead to gather open telemetry traces in HoneyComb in prod but in qa I did see that the BFS across n hops was a bit slower (and also cause we can't push down entity type predicates to ES as we go hop by hop)
o
Yes, writes don't scale super well since it's a master-slave design and the bulk loading is designed weirdly where you write to Kafka: https://neo4j.com/labs/kafka/4.0/architecture/sinkconsume/ it's not really optimized around write performance. This is definitely an area we want to improve, but I don't have an exact timeline to give. There were some recent improvements around lineage per this PR: https://github.com/datahub-project/datahub/pull/5858
b
Yeah an unknown on our end would be how the ingest perf is if we are using Aws Neptune (as I think the neo4j licensing might make it a bit cumbersome to get sorted out in a reasonable timeframe for us internally). Our changelog time series vol isn't high (5-10 events / s). Versioned is ~20-30 events / s. I was toying with the idea of slapping on a memcache in front of the lineage lookup. We currently have an instance in-mem cache but that doesn't work across instances. We'll still have the first calls for an URN being slow, but we might amortize it a bit.
Thanks for sharing the PR, let me check that out
ok circling back @orange-night-91387 - I cherrypicked the PR and the dependent PR (https://github.com/datahub-project/datahub/pull/5539 and https://github.com/datahub-project/datahub/pull/5858) and it looks like for a couple of URNs I tested the searchAcrossLineage calls are down from ~10s -> around a second!
o
Awesome! Glad to hear this made such a dramatic improvement for you 😄 we definitely still want to see how we can bring this down more, but glad to see this brings us to a relatively comfortable spot.
b
Yeah looking at our HoneyComb traces in our qa env it looks like the multi-hop nature of the BFS means it's likely going to take a few hundred ms to a seconds or so. Each BFS hop seems to be ~100-120ms ish