Hello, so I have ingested a few times from Hive wh...
# ingestion
m
Hello, so I have ingested a few times from Hive while the profiling enabled. Everything has executed perfectly but for the duration of the proccess. One time it took 1569s (which is equivalent to 26 mins), but the last time it took it 3311s (which is almost an hour). I have been trying to increase property
max_workers
but haven't improve the times. I don't think it is a problem with either the quantity of data (as I only have 4 tables, and the maximum number of rows in them is 30) or my deployment as it didn't use to be this slow. Any tips how to either improve the ingestion or to determine the actual cause of the problem?? (The profiling in other sources is normal, so it isn't a problem with either the ingestion or profiling, but a problem with Hive ingestion)
h
Hi @microscopic-mechanic-13766 how many columns are there in these tables ? Have you checked the logs already ? You can also enable debug level logs to see more details.
m
The maximum number of columns in a table is 7. The logs don't indicate any type of error. It is true that in the new "live" logs that can be seen in the ingestion tab, it sometimes indicates that they could be stale as the last update was some time ago. That would be the only "bad" thing that appears on the logs the rest of it is a regular log of the steps taken in the process (as I already have debug enabled).
For example, I have launched a new ingestion around 13:00 pm (now it is 15:52 pm), and the last message that it printed was the following:
Copy code
[2022-09-20 13:23:29,781] DEBUG    {datahub.ingestion.run.pipeline:43} -  sink wrote workunit profile-default.prueba
[2022-09-20 13:23:29,783] DEBUG    {datahub.emitter.rest_emitter:235} - Attempting to emit to DataHub GMS; using curl equivalent to:
curl -X POST -H 'User-Agent: python-requests/2.28.0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' -H 'Authorization: Basic __datahub_system:JohnSnowKnowsNothing' --data '{"proposal": {"entityType": "dataset", "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hive,default.prueba2,PROD)", "changeType": "UPSERT", "aspectName": "datasetProfile", "aspect": {"value": "{\"timestampMillis\": 1663671701130, \"partitionSpec\": {\"type\": \"FULL_TABLE\", \"partition\": \"FULL_TABLE_SNAPSHOT\"}, \"rowCount\": 0, \"columnCount\": 2, \"fieldProfiles\": [{\"fieldPath\": \"name\", \"uniqueCount\": 0, \"nullCount\": 0, \"sampleValues\": []}, {\"fieldPath\": \"age\", \"uniqueCount\": 0, \"nullCount\": 0, \"sampleValues\": []}]}", "contentType": "application/json"}, "systemMetadata": {"lastObserved": 1663673009688, "runId": "hive-2022_09_20-13_01_02"}}}' '<http://datahub-gms:8080/aspects?action=ingestProposal>'
[2022-09-20 13:23:29,798] DEBUG    {datahub.ingestion.run.pipeline:43} -  sink wrote workunit profile-default.prueba2
I am guessing that for some reason it stopped executing or it got stuck at that point.