Hi guys I am new to DataHub and there are some doubt when i DataHub #ingestion

Hi, guys, I am new to DataHub and there are some d...

gentle-optician-51037

09/02/2021, 9:11 AM

Hi, guys, I am new to DataHub and there are some doubt when i was learning。 I'm still trying hard to learn official documents, but I'm really curious about the answers to these simple questions ： 1. How the dataHub know the lineage，for example ，I have two table，tbA ，tbB，all the data of B comes from A，and the tbAB ervery hour will be created .When ingesting data，Could the DataHub analysis the relationship？ or the lineage is to description of other something？ 2. I have bulild the DataHub and ingesting my hive data. When I create new table, I can't find it in the Datahub. Before reading the document, I remember that the data change log can be obtained automatically. Is there something wrong or I need to configure anything？ 3. Our hive tables change frequently. Hundreds of tasks generate new tables every hour, and then clear these tables regularly. How does the datahub handle them? I would appreciate it if you could help me solve my doubts，and I will continue to pay attention and learn and use Datahub。

square-activity-64562

09/02/2021, 9:28 AM

Lineage has to be ingested. If using airflow you can use https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow. Or if you want to manually link without a job in between you can send lineage directly like this example

square-activity-64562

09/02/2021, 9:30 AM

data change log can be obtained automatically

you have to schedule a job somewhere using a scheduler which will run your recipe regularly. This regular job using the recipe will ingest the new table

square-activity-64562

09/02/2021, 9:31 AM

Datahub will only ingest what you tell it to ingest. There are schema & table name filters. You can use it to ingest only the one that you want in datahub

square-activity-64562

09/02/2021, 9:32 AM

You can check hive docs at https://datahubproject.io/docs/metadata-ingestion/source_docs/hive to see what options are there

gentle-optician-51037

09/02/2021, 9:39 AM

OK, thanks for answer very much，airflow I haven't used it，i need to learn more about this and the hive docs.

square-activity-64562

09/02/2021, 11:25 AM

you don't need airflow. It is just commonly used. You can use the code example to send the lineage directly

some-microphone-33485

09/02/2021, 2:28 PM

For question 1 : There are some lineage relationship can be automated from the dashboards to the Databases based on the connection map parameters . If you are building destination tables from sources tables then the lineage has to be ingested .

gentle-optician-51037

09/03/2021, 1:19 AM

Hi MilanSahu. about your suggestion. I wonder to know if i need to sort out the lineage relationship between all my tables at first and then ingeste the lineage？ My current situation is my produciton Hive has thousands of table. They are in a mess and i just know some of them has the relationship and some of them has three-tier relationship. It's also in the data dictionary. but it's mess also. How i deal with this scene? And the connection map parameters need i to configure it ? It's hard for me to understand

some-microphone-33485

09/03/2021, 1:55 PM

You can try to resolve these issues one by one . • First ingest all the tables . Validate the table metedata is ingested correctly • Try to ingest the dashboards form the BI platform you are using (Here the connectionmap works . You may check the documentation of looker ) • Then the remaining lineage try to add to your orchestrator(airflow) . This would be a continuous process not one time .

proud-baker-56489

07/05/2022, 3:32 PM

@square-activity-64562 does hive ingestion support columns description capture?Doing this need transform function?

square-activity-64562

07/06/2022, 7:47 AM

yes hive should support column description. no transform function needed

Open in Slack

Previous Next