Hi, guys, I am new to DataHub and there are some d...
# ingestion
g
Hi, guys, I am new to DataHub and there are some doubt when i was learning。 I'm still trying hard to learn official documents, but I'm really curious about the answers to these simple questions : 1. How the dataHub know the lineage,for example ,I have two table,tbA ,tbB,all the data of B comes from A,and the tbAB ervery hour will be created .When ingesting data,Could the DataHub  analysis the relationship? or the lineage is to description of other something? 2. I have bulild the DataHub and ingesting my hive data.  When I create new table, I can't find it in the Datahub.  Before reading the document, I remember that the data change log can be obtained automatically. Is there something wrong or I need to configure anything? 3. Our hive tables change frequently. Hundreds of tasks generate new tables every hour, and then clear these tables regularly. How does the datahub handle them? I would appreciate it if you could help me solve my doubts,and I will continue to pay attention and learn and use Datahub。
s
Lineage has to be ingested. If using airflow you can use https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow. Or if you want to manually link without a job in between you can send lineage directly like this example
data change log can be obtained automatically
you have to schedule a job somewhere using a scheduler which will run your recipe regularly. This regular job using the recipe will ingest the new table
Datahub will only ingest what you tell it to ingest. There are schema & table name filters. You can use it to ingest only the one that you want in datahub
You can check hive docs at https://datahubproject.io/docs/metadata-ingestion/source_docs/hive to see what options are there
g
OK, thanks for answer very much,airflow I haven't used it,i need to learn more about this and the hive docs.
s
you don't need airflow. It is just commonly used. You can use the code example to send the lineage directly
s
For question 1 : There are some lineage relationship can be automated from the dashboards to the Databases based on the connection map parameters . If you are building destination tables from sources tables then the lineage has to be ingested .
g
Hi MilanSahu. about your suggestion. I wonder to know if i need to sort out the lineage relationship between all my tables at first and then ingeste the lineage? My current situation is my produciton Hive has thousands of table. They are in a mess and i just know some of them has the relationship and some of them has three-tier relationship. It's also in the data dictionary. but it's mess also. How i deal with this scene? And the connection map parameters need i to configure it ? It's hard for me to understand
s
You can try to resolve these issues one by one . • First ingest all the tables . Validate the table metedata is ingested correctly • Try to ingest the dashboards form the BI platform you are using (Here the connectionmap works . You may check the documentation of looker ) • Then the remaining lineage try to add to your orchestrator(airflow) . This would be a continuous process not one time .
p
@square-activity-64562 does hive ingestion support columns description capture?Doing this need transform function?
s
yes hive should support column description. no transform function needed