hello! I have a problem with hive ingestion. There...
# ingestion
f
hello! I have a problem with hive ingestion. There is tables in the hive DB, but when I bring them as ingestion, I cannot see the tables to the datahub. Instead, if I put the 'include_views: true' option, the table information is brought to the dat hub, but as the view tables... I am attaching the captured picture that I did ingestion I would be very grateful if you could tell me the solution...!!!
g
Could you paste the logs from the actual ingestion run?
f
When I put the 'include_views: true' option, the log looks like this:
Copy code
~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '26cb9a6b-395a-4a11-a539-62025765b1f2',
 'infos': ['2022-09-27 04:18:07.857268 [exec_id=26cb9a6b-395a-4a11-a539-62025765b1f2] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-27 04:18:24.032800 [exec_id=26cb9a6b-395a-4a11-a539-62025765b1f2] INFO: stdout=Elapsed seconds = 0\n'
           '  --report-to TEXT                Provide an destination to send a structured\n'
           'This version of datahub supports report-to functionality\n'
           'datahub  ingest run -c /tmp/datahub/ingest/26cb9a6b-395a-4a11-a539-62025765b1f2/recipe.yml --report-to '
           '/tmp/datahub/ingest/26cb9a6b-395a-4a11-a539-62025765b1f2/ingestion_report.json\n'
           '[2022-09-27 04:18:11,108] INFO     {datahub.cli.ingest_cli:179} - DataHub CLI version: 0.8.44\n'
           '[2022-09-27 04:18:11,139] INFO     {datahub.ingestion.run.pipeline:165} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           '[2022-09-27 04:18:16,524] INFO     {datahub.ingestion.run.pipeline:190} - Source configured successfully.\n'
           '[2022-09-27 04:18:16,526] INFO     {datahub.cli.ingest_cli:126} - Starting metadata ingestion\n'
           '[2022-09-27 04:18:22,081] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:73} - Querying for '
           "the latest ingestion checkpoint for pipelineName:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26', "
           "platformInstanceId:'hive_192.168.91.140:10000_gsc_ods', job_name:'common_ingest_from_sql_source'\n"
           '[2022-09-27 04:18:22,095] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:93} - The last '
           "committed ingestion checkpoint for pipelineName:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26', "
           "platformInstanceId:'hive_192.168.91.140:10000_gsc_ods', job_name:'common_ingest_from_sql_source' found with start_time: 2022-09-26 "
           '10:52:01.128000+00:00 and a bucket duration of None.\n'
           '[2022-09-27 04:18:22,096] INFO     {datahub.ingestion.source.state.checkpoint:130} - Successfully constructed last checkpoint state for '
           'job common_ingest_from_sql_source\n'
           '[2022-09-27 04:18:22,159] INFO     {datahub.ingestion.run.pipeline:420} - Processing commit request for '
           'DatahubIngestionCheckpointingProvider. Commit policy = CommitPolicy.ON_NO_ERRORS, has_errors=False, has_warnings=False\n'
           '[2022-09-27 04:18:22,159] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:140} - Committing '
           'ingestion checkpoint for '
           "pipeline:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26',instance:'hive_192.168.91.140:10000_gsc_ods', "
           "job:'common_ingest_from_sql_source'\n"
           '[2022-09-27 04:18:22,169] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:166} - Committed '
           'ingestion checkpoint for '
           "pipeline:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26',instance:'hive_192.168.91.140:10000_gsc_ods', "
           "job:'common_ingest_from_sql_source'\n"
           '[2022-09-27 04:18:22,169] INFO     {datahub.ingestion.run.pipeline:440} - Successfully committed changes for '
           'DatahubIngestionCheckpointingProvider.\n'
           '[2022-09-27 04:18:22,170] INFO     {datahub.ingestion.reporting.file_reporter:54} - Wrote SUCCESS report successfully to '
           "<_io.TextIOWrapper name='/tmp/datahub/ingest/26cb9a6b-395a-4a11-a539-62025765b1f2/ingestion_report.json' mode='w' encoding='UTF-8'>\n"
           '[2022-09-27 04:18:22,170] INFO     {datahub.cli.ingest_cli:147} - Finished metadata ingestion\n'
           '\n'
           'Cli report:\n'
           "{'cli_version': '0.8.44',\n"
           " 'cli_entry_location': '/tmp/datahub/ingest/venv-hive-0.8.44/lib/python3.9/site-packages/datahub/__init__.py',\n"
           " 'py_version': '3.9.9 (main, Dec 21 2021, 10:03:34) \\n[GCC 10.2.1 20210110]',\n"
           " 'py_exec_path': '/tmp/datahub/ingest/venv-hive-0.8.44/bin/python3',\n"
           " 'os_details': 'Linux-5.4.0-65-generic-x86_64-with-glibc2.31'}\n"
           'Source (hive) report:\n'
           "{'events_produced': '42',\n"
           " 'events_produced_per_sec': '7',\n"
           " 'event_ids': ['gsc_ods.sstp_stp_item_sbc-subtypes',\n"
           "               'gsc_ods.sstp_stp_mst-subtypes',\n"
           "               'sstp_stp_item-viewProperties',\n"
           '               '
           "'container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_cls,PROD)',\n"
           "               'sstp_stp_item_cls-subtypes',\n"
           '               '
           "'container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_sbc,PROD)',\n"
           '               '
           "'container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_mst,PROD)',\n"
           "               'gsc_ods.sstp_stp_rslt',\n"
           "               'sstp_stp_rslt-subtypes',\n"
           "               'sstp_stp_rslt-viewProperties',\n"
           "               '... sampled of 42 total elements'],\n"
           " 'warnings': {},\n"
           " 'failures': {},\n"
           " 'tables_scanned': '5',\n"
           " 'views_scanned': '5',\n"
           " 'entities_profiled': '0',\n"
           " 'filtered': [],\n"
           " 'soft_deleted_stale_entities': [],\n"
           " 'start_time': '2022-09-27 04:18:16.274423',\n"
           " 'running_time_in_seconds': '6'}\n"
           'Sink (datahub-rest) report:\n'
           "{'total_records_written': '42',\n"
           " 'records_written_per_second': '3',\n"
           " 'warnings': [],\n"
           " 'failures': [],\n"
           " 'start_time': '2022-09-27 04:18:10.041446',\n"
           " 'current_time': '2022-09-27 04:18:22.418061',\n"
           " 'total_duration_in_seconds': '12.38',\n"
           " 'gms_version': 'v0.8.44',\n"
           " 'pending_requests': '0'}\n"
           '\n'
           ' Pipeline finished successfully ; produced 42 events in 6 seconds.\n',
           "2022-09-27 04:18:24.033103 [exec_id=26cb9a6b-395a-4a11-a539-62025765b1f2] INFO: Successfully executed 'datahub ingest'"],
 'structured_report': '{"source": {"type": "hive", "report": {"events_produced": "42", "events_produced_per_sec": "8", "event_ids": '
                      '["gsc_ods.sstp_stp_item_sbc-subtypes", "gsc_ods.sstp_stp_mst-subtypes", "sstp_stp_item-viewProperties", '
                      '"container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_cls,PROD)", '
                      '"sstp_stp_item_cls-subtypes", '
                      '"container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_sbc,PROD)", '
                      '"container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_mst,PROD)", '
                      '"gsc_ods.sstp_stp_rslt", "sstp_stp_rslt-subtypes", "sstp_stp_rslt-viewProperties", "... sampled of 42 total elements"], '
                      '"warnings": {}, "failures": {}, "tables_scanned": "5", "views_scanned": "5", "entities_profiled": "0", "filtered": [], '
                      '"soft_deleted_stale_entities": [], "start_time": "2022-09-27 04:18:16.274423", "running_time_in_seconds": "5"}}, "sink": '
                      '{"type": "datahub-rest", "report": {"total_records_written": "42", "records_written_per_second": "3", "warnings": [], '
                      '"failures": [], "start_time": "2022-09-27 04:18:10.041446", "current_time": "2022-09-27 04:18:22.169722", '
                      '"total_duration_in_seconds": "12.13", "gms_version": "v0.8.44", "pending_requests": "0"}}}'}
Execution finished successfully!
When there is no such option, the log is as follows.
Copy code
~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': 'dd500399-b2b8-4393-91b8-344f6bc4b9e3',
 'infos': ['2022-09-27 04:21:55.525345 [exec_id=dd500399-b2b8-4393-91b8-344f6bc4b9e3] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-27 04:22:05.634332 [exec_id=dd500399-b2b8-4393-91b8-344f6bc4b9e3] INFO: stdout=Elapsed seconds = 0\n'
           '  --report-to TEXT                Provide an destination to send a structured\n'
           'This version of datahub supports report-to functionality\n'
           'datahub  ingest run -c /tmp/datahub/ingest/dd500399-b2b8-4393-91b8-344f6bc4b9e3/recipe.yml --report-to '
           '/tmp/datahub/ingest/dd500399-b2b8-4393-91b8-344f6bc4b9e3/ingestion_report.json\n'
           '[2022-09-27 04:21:57,486] INFO     {datahub.cli.ingest_cli:179} - DataHub CLI version: 0.8.44\n'
           '[2022-09-27 04:21:57,515] INFO     {datahub.ingestion.run.pipeline:165} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           '[2022-09-27 04:21:59,473] INFO     {datahub.ingestion.run.pipeline:190} - Source configured successfully.\n'
           '[2022-09-27 04:21:59,474] INFO     {datahub.cli.ingest_cli:126} - Starting metadata ingestion\n'
           '[2022-09-27 04:22:02,982] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:73} - Querying for '
           "the latest ingestion checkpoint for pipelineName:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26', "
           "platformInstanceId:'hive_192.168.91.140:10000_gsc_ods', job_name:'common_ingest_from_sql_source'\n"
           '[2022-09-27 04:22:02,997] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:93} - The last '
           "committed ingestion checkpoint for pipelineName:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26', "
           "platformInstanceId:'hive_192.168.91.140:10000_gsc_ods', job_name:'common_ingest_from_sql_source' found with start_time: 2022-09-27 "
           '04:18:22.097000+00:00 and a bucket duration of None.\n'
           '[2022-09-27 04:22:02,997] INFO     {datahub.ingestion.source.state.checkpoint:130} - Successfully constructed last checkpoint state for '
           'job common_ingest_from_sql_source\n'
           '[2022-09-27 04:22:02,997] INFO     {datahub.ingestion.source.sql.sql_common:626} - Soft-deleting stale entity of type view - '
           'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_rslt,PROD).\n'
           '[2022-09-27 04:22:02,998] INFO     {datahub.ingestion.source.sql.sql_common:626} - Soft-deleting stale entity of type view - '
           'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_cls,PROD).\n'
           '[2022-09-27 04:22:02,998] INFO     {datahub.ingestion.source.sql.sql_common:626} - Soft-deleting stale entity of type view - '
           'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_mst,PROD).\n'
           '[2022-09-27 04:22:02,998] INFO     {datahub.ingestion.source.sql.sql_common:626} - Soft-deleting stale entity of type view - '
           'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_sbc,PROD).\n'
           '[2022-09-27 04:22:02,999] INFO     {datahub.ingestion.source.sql.sql_common:626} - Soft-deleting stale entity of type view - '
           'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item,PROD).\n'
           '[2022-09-27 04:22:03,132] INFO     {datahub.ingestion.run.pipeline:420} - Processing commit request for '
           'DatahubIngestionCheckpointingProvider. Commit policy = CommitPolicy.ON_NO_ERRORS, has_errors=False, has_warnings=False\n'
           '[2022-09-27 04:22:03,132] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:140} - Committing '
           'ingestion checkpoint for '
           "pipeline:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26',instance:'hive_192.168.91.140:10000_gsc_ods', "
           "job:'common_ingest_from_sql_source'\n"
           '[2022-09-27 04:22:03,140] INFO     {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:166} - Committed '
           'ingestion checkpoint for '
           "pipeline:'urn:li:dataHubIngestionSource:0f7c7bfb-d1d0-4e4f-93d1-0e248952aa26',instance:'hive_192.168.91.140:10000_gsc_ods', "
           "job:'common_ingest_from_sql_source'\n"
           '[2022-09-27 04:22:03,140] INFO     {datahub.ingestion.run.pipeline:440} - Successfully committed changes for '
           'DatahubIngestionCheckpointingProvider.\n'
           '[2022-09-27 04:22:03,141] INFO     {datahub.ingestion.reporting.file_reporter:54} - Wrote SUCCESS report successfully to '
           "<_io.TextIOWrapper name='/tmp/datahub/ingest/dd500399-b2b8-4393-91b8-344f6bc4b9e3/ingestion_report.json' mode='w' encoding='UTF-8'>\n"
           '[2022-09-27 04:22:03,141] INFO     {datahub.cli.ingest_cli:147} - Finished metadata ingestion\n'
           '\n'
           'Cli report:\n'
           "{'cli_version': '0.8.44',\n"
           " 'cli_entry_location': '/tmp/datahub/ingest/venv-hive-0.8.44/lib/python3.9/site-packages/datahub/__init__.py',\n"
           " 'py_version': '3.9.9 (main, Dec 21 2021, 10:03:34) \\n[GCC 10.2.1 20210110]',\n"
           " 'py_exec_path': '/tmp/datahub/ingest/venv-hive-0.8.44/bin/python3',\n"
           " 'os_details': 'Linux-5.4.0-65-generic-x86_64-with-glibc2.31'}\n"
           'Source (hive) report:\n'
           "{'events_produced': '27',\n"
           " 'events_produced_per_sec': '6',\n"
           " 'event_ids': ['container-platforminstance-gsc_ods-urn:li:container:93a7f4080f9d1d30c44551ff89691612',\n"
           "               'container-subtypes-gsc_ods-urn:li:container:93a7f4080f9d1d30c44551ff89691612',\n"
           "               'container-info-gsc_ods-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af',\n"
           '               '
           "'container-parent-container-gsc_ods-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-urn:li:container:93a7f4080f9d1d30c44551ff89691612',\n"
           "               'gsc_ods.sstp_stp_item',\n"
           "               'gsc_ods.sstp_stp_item-subtypes',\n"
           "               'gsc_ods.sstp_stp_item_sbc',\n"
           "               'gsc_ods.sstp_stp_mst-subtypes',\n"
           '               '
           "'container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_rslt,PROD)',\n"
           "               'soft-delete-view-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_rslt,PROD)',\n"
           "               '... sampled of 27 total elements'],\n"
           " 'warnings': {},\n"
           " 'failures': {},\n"
           " 'tables_scanned': '5',\n"
           " 'views_scanned': '0',\n"
           " 'entities_profiled': '0',\n"
           " 'filtered': [],\n"
           " 'soft_deleted_stale_entities': ['urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_rslt,PROD)',\n"
           "                                 'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_cls,PROD)',\n"
           "                                 'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_mst,PROD)',\n"
           "                                 'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_sbc,PROD)',\n"
           "                                 'urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item,PROD)'],\n"
           " 'start_time': '2022-09-27 04:21:59.259988',\n"
           " 'running_time_in_seconds': '4'}\n"
           'Sink (datahub-rest) report:\n'
           "{'total_records_written': '27',\n"
           " 'records_written_per_second': '3',\n"
           " 'warnings': [],\n"
           " 'failures': [],\n"
           " 'start_time': '2022-09-27 04:21:56.553535',\n"
           " 'current_time': '2022-09-27 04:22:03.351593',\n"
           " 'total_duration_in_seconds': '6.8',\n"
           " 'gms_version': 'v0.8.44',\n"
           " 'pending_requests': '0'}\n"
           '\n'
           ' Pipeline finished successfully ; produced 27 events in 4 seconds.\n',
           "2022-09-27 04:22:05.634597 [exec_id=dd500399-b2b8-4393-91b8-344f6bc4b9e3] INFO: Successfully executed 'datahub ingest'"],
 'structured_report': '{"source": {"type": "hive", "report": {"events_produced": "27", "events_produced_per_sec": "9", "event_ids": '
                      '["container-platforminstance-gsc_ods-urn:li:container:93a7f4080f9d1d30c44551ff89691612", '
                      '"container-subtypes-gsc_ods-urn:li:container:93a7f4080f9d1d30c44551ff89691612", '
                      '"container-info-gsc_ods-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af", '
                      '"container-parent-container-gsc_ods-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-urn:li:container:93a7f4080f9d1d30c44551ff89691612", '
                      '"gsc_ods.sstp_stp_item", "gsc_ods.sstp_stp_item-subtypes", "gsc_ods.sstp_stp_item_sbc", "gsc_ods.sstp_stp_mst-subtypes", '
                      '"container-urn:li:container:21c2cec8d1e1252753fdf82a6eb422af-to-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_rslt,PROD)", '
                      '"soft-delete-view-urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_rslt,PROD)", "... sampled of 27 total elements"], '
                      '"warnings": {}, "failures": {}, "tables_scanned": "5", "views_scanned": "0", "entities_profiled": "0", "filtered": [], '
                      '"soft_deleted_stale_entities": ["urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_rslt,PROD)", '
                      '"urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_cls,PROD)", '
                      '"urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_mst,PROD)", '
                      '"urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item_sbc,PROD)", '
                      '"urn:li:dataset:(urn:li:dataPlatform:hive,gsc_ods.sstp_stp_item,PROD)"], "start_time": "2022-09-27 04:21:59.259988", '
                      '"running_time_in_seconds": "3"}}, "sink": {"type": "datahub-rest", "report": {"total_records_written": "27", '
                      '"records_written_per_second": "4", "warnings": [], "failures": [], "start_time": "2022-09-27 04:21:56.553535", '
                      '"current_time": "2022-09-27 04:22:03.141100", "total_duration_in_seconds": "6.59", "gms_version": "v0.8.44", '
                      '"pending_requests": "0"}}}'}
Execution finished successfully!
g
On both runs, the logs say
'tables_scanned': '5'
. I suspect there’s an issue with stateful ingestion marking the tables as soft-deleted. Could you try running ingestion with
include_views: false
and stateful_ingestion.ignore_old_state set to false
f
OMG!! All tables are shown!! Thank you very much!!
g
Amazing 🙂