Hello we noticed that this kubernetes job datahub-...
# troubleshoot
r
Hello we noticed that this kubernetes job datahub-restore-indices-job-template acryldata/datahub-upgrade:v0.9.1 is taking very long time. How can we speed it up given large set of records? some stats: job run for 22 hours and successfully sent MAEs for 1076000/1282980 rows (83.87% of total). 0 rows ignored (0.00% of total)
1
i
Hello Mahmoud, The restore indicies job is limited by the amount of resources it has as well as the scale of gms & elastic cluster. Are you using default values? Based on that I can look for some recommendations.
r
yes, we use the default values, but we scaled pods of gms and elastic to 3 pods each.
a
Hello Pedro, I'm a colleague of Mahmoud. Another example entry from the gms logs is:
Copy code
2022-12-07T09:12:49.415Z | Args are RestoreIndicesArgs(start=1081000, batchSize=1000, numThreads=1, batchDelayMs=100, aspectName=null, urn=null, urnLike=null)

2022-12-07T09:12:49.415Z | Reading rows 1081000 through 1082000 from the aspects table started.

2022-12-07T09:12:49.415Z | Reading rows 1081000 through 1082000 from the aspects table completed.

2022-12-07T09:12:49.415Z | metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=1081000, timeSqlQueryMs=641, timeGetRowMs=0, timeUrnMs=2354, timeEntityRegistryCheckMs=827, aspectCheckMs=580, createRecordMs=62871, sendMessageMs=308002)

2022-12-07T09:12:49.415Z | Successfully sent MAEs for 1081000/1282980 rows (84.26% of total). 0 rows ignored (0.00% of total)

2022-12-07T09:12:49.415Z | 1394.20 mins taken. 260.50 est. mins to completion. Total mins est. = 1654.70.
r
helm charts of v0.9.1 were used
i
The restore indicies job works by: • Reading metadata data from the SQL DB that DataHub uses. • For each aspect in the metadata DB, generates MCL messages that get sent to Kafka. Essentially generating a backlog of work (MCL messages). You can follow this logic in https://github.com/datahub-project/datahub/blob/master/datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java DataHub’s backend service listens to MCL messages in the Kafka topic and processes them using idempotent operations that connect to Elastic. This is async. From your logs it seems that restore indicies job took 22 hours to read 84.26% of the metadata graph in DataHub. Not necessarily to process all those messages. So this is a read-side limitation. There are a few things that can be done to speed this up. You can increase number of threads (by default it’s 1) that the restore indicies job uses to trigger reads from the DB. You can also increase the number of messages in each batch of work that restore indicies generates (by default it’s 1000 aspects).
r
ideally we want to distribute the workload e.g, multiple jobs each job process subset and send it to gms. is that possible?
i
That is what the parameters of the job are for. A restore indicies job triggers a
KafkaJob
:
Copy code
public class KafkaJob implements Callable<RestoreIndicesResult>
That processes a subset of the data from the DB: https://github.com/datahub-project/datahub/blob/626a06445a39457e276c59352ff58a2fd2[…]va/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java
Note, each one of these
KafkaJob
will call a method (
restoreIndices
) in GMS’s EntityService which will use a SQL connection to the DB, meaning that if the DB is configured to have say 50 connections and GMS has the same config you can only run at most 50 KafkaJobs at the same time but it would likely have an unintended side effect if GMS needed a sql connection for some other work.
I would highly recomment keeping the number of concurrent
KafkaJob
fairly low but this depends on your config.
r
thanks for the info @incalculable-ocean-74010