Hey folks We are currently running datahub `v0 8 38` and we DataHub #troubleshoot

Hey folks! We are currently running datahub @ `v0....

big-ocean-9800

07/20/2022, 6:12 PM

Hey folks! We are currently running datahub @

v0.8.38

and we have about 7k data assets loaded. We are seeing a pattern where loading the home page is extremely slow (on the order of 5-10 seconds). I checked metrics around our datahub infrastructure and everything was running at about 10-20% utilization. Our elastic search cluster is at low utilization, their disks are less than 10% utilized, and I don’t see any IO throttling from our cloud provider. Same story with our Postgres instance. I took a look at the calls that hang the longest on the home page and the consistently slow call is the graphql call

searchAcrossEntities

. By taking a cursory look through the code, I can see that it seems to interact with just elastic search. I’m here wondering if anyone has experienced a similar behavior, any troubleshooting tips, etc. Is this expected performance with the number of assets we have? Are there any changes we can make to our elastic cluster to help alleviate these problems? I took a look through the slack history through this channel and couldn’t quite find any messages which seem similar (same with github issues both open and closed). Please let me know if any more information would be helpful. Cheers!

orange-night-91387

07/20/2022, 7:35 PM

We're aware of this latency and very recently put in some measures to address this. Can you update to latest and let us know if it improves? cc: @big-carpet-38439

big-ocean-9800

07/20/2022, 7:50 PM

Thanks, @orange-night-91387! I missed this release update from datahub! Reading through

v0.8.41

release notes now and will try it out.

big-carpet-38439

07/20/2022, 10:29 PM

Yes Hunter let me know how goes. We also found that K8s was arbitrarily throttling our CPU.. We upped our GMS CPU requests which really helped!

big-ocean-9800

07/25/2022, 8:16 PM

@orange-night-91387 @big-carpet-38439 I upped our CPU request (running at about 20% cpu utilization for GMS pod) and upgraded to the latest datahub release. I’m still seeing about ~2 seconds of latency for the home page

listRecommendations

graphql call. I’m pulling up the traces for it now to see where it’s spending most of the time. One question I have to y’all, we are running our elastic cluster on non-SSDs, this may be the ultimate bottleneck, but wanted to gauge if you are running elastic with ssds or not for datahub. Just want to gauge what a deployment from another group looks like before I start testing out different hardware.

orange-night-91387

07/25/2022, 8:23 PM

We're using SSD storage. Not sure if we have any head-to-head comparisons floating around, but since Elastic can be pretty disk heavy depending on the set up, that could be impacting your deployment. Worth a shot at least if it's not an on-prem deployment that requires you to go out and buy SSDs to test with 😅

big-ocean-9800

07/25/2022, 8:25 PM

Ha, yeah! We are in a cloud provider so it shouldn’t be too big of a deal outside of cost. I see in some of the traces that a lot of time is spent querying elastic. I see extremely low disk utilization for our cluster and no write throttling from the cloud provider. So, it must just be the overall access speed to storage for elastic.

big-ocean-9800

07/25/2022, 8:25 PM

Thanks, @orange-night-91387

5 Views

Open in Slack

Previous Next