Hello we noticed that this kubernetes job datahub restore in DataHub #troubleshoot

Hello we noticed that this kubernetes job datahub-...

ripe-eye-60209

12/07/2022, 8:56 AM

Hello we noticed that this kubernetes job datahub-restore-indices-job-template acryldata/datahub-upgrade:v0.9.1 is taking very long time. How can we speed it up given large set of records? some stats: job run for 22 hours and successfully sent MAEs for 1076000/1282980 rows (83.87% of total). 0 rows ignored (0.00% of total)

✅ 1

incalculable-ocean-74010

12/07/2022, 9:33 AM

Hello Mahmoud, The restore indicies job is limited by the amount of resources it has as well as the scale of gms & elastic cluster. Are you using default values? Based on that I can look for some recommendations.

ripe-eye-60209

12/07/2022, 9:43 AM

yes, we use the default values, but we scaled pods of gms and elastic to 3 pods each.

acceptable-morning-73148

12/07/2022, 9:43 AM

Hello Pedro, I'm a colleague of Mahmoud. Another example entry from the gms logs is:

Copy code

2022-12-07T09:12:49.415Z | Args are RestoreIndicesArgs(start=1081000, batchSize=1000, numThreads=1, batchDelayMs=100, aspectName=null, urn=null, urnLike=null)

2022-12-07T09:12:49.415Z | Reading rows 1081000 through 1082000 from the aspects table started.

2022-12-07T09:12:49.415Z | Reading rows 1081000 through 1082000 from the aspects table completed.

2022-12-07T09:12:49.415Z | metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=1081000, timeSqlQueryMs=641, timeGetRowMs=0, timeUrnMs=2354, timeEntityRegistryCheckMs=827, aspectCheckMs=580, createRecordMs=62871, sendMessageMs=308002)

2022-12-07T09:12:49.415Z | Successfully sent MAEs for 1081000/1282980 rows (84.26% of total). 0 rows ignored (0.00% of total)

2022-12-07T09:12:49.415Z | 1394.20 mins taken. 260.50 est. mins to completion. Total mins est. = 1654.70.

ripe-eye-60209

12/07/2022, 9:44 AM

helm charts of v0.9.1 were used

incalculable-ocean-74010

12/07/2022, 10:31 AM

The restore indicies job works by: • Reading metadata data from the SQL DB that DataHub uses. • For each aspect in the metadata DB, generates MCL messages that get sent to Kafka. Essentially generating a backlog of work (MCL messages). You can follow this logic in https://github.com/datahub-project/datahub/blob/master/datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java DataHub’s backend service listens to MCL messages in the Kafka topic and processes them using idempotent operations that connect to Elastic. This is async. From your logs it seems that restore indicies job took 22 hours to read 84.26% of the metadata graph in DataHub. Not necessarily to process all those messages. So this is a read-side limitation. There are a few things that can be done to speed this up. You can increase number of threads (by default it’s 1) that the restore indicies job uses to trigger reads from the DB. You can also increase the number of messages in each batch of work that restore indicies generates (by default it’s 1000 aspects).

ripe-eye-60209

12/07/2022, 10:38 AM

ideally we want to distribute the workload e.g, multiple jobs each job process subset and send it to gms. is that possible?

incalculable-ocean-74010

12/07/2022, 10:40 AM

That is what the parameters of the job are for. A restore indicies job triggers a

KafkaJob

Copy code

public class KafkaJob implements Callable<RestoreIndicesResult>

That processes a subset of the data from the DB: https://github.com/datahub-project/datahub/blob/626a06445a39457e276c59352ff58a2fd2[…]va/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java

incalculable-ocean-74010

12/07/2022, 10:42 AM

Note, each one of these

KafkaJob

will call a method (

restoreIndices

) in GMS’s EntityService which will use a SQL connection to the DB, meaning that if the DB is configured to have say 50 connections and GMS has the same config you can only run at most 50 KafkaJobs at the same time but it would likely have an unintended side effect if GMS needed a sql connection for some other work.

incalculable-ocean-74010

12/07/2022, 10:42 AM

I would highly recomment keeping the number of concurrent

KafkaJob

fairly low but this depends on your config.

ripe-eye-60209

12/07/2022, 2:31 PM

thanks for the info @incalculable-ocean-74010

Open in Slack

Previous Next