Hi, I have some question about Datahub. How datahu...
# ingestion
e
Hi, I have some question about Datahub. How datahub get mysql/hive lineage infomation? Is there some hook processing like Atlas? I cant find that from Datahub's code or doc.
g
We don’t have anything off the shelf, but @acceptable-architect-70237 has done this internally and wrote about it here https://firststr.com/2021/04/26/apache-hive-lineage-to-datahub/
for other sources, it’s generally a much more ad-hoc process. if you’re running ETL using airflow, you can use our native integration https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow. Otherwise, you’ll need to collect lineage on your own and then push it into datahub using emitters https://datahubproject.io/docs/metadata-ingestion/#using-as-a-library (example)
e
Okey, I see. Thank you very much @gray-shoe-75895
m
@gray-shoe-75895 I seem to have an issue with viewing the lineage in the UI. Seems like the metadata ingestion is working ok, and I see the data populated. I am using Airflow with a mysql data source with the mysql_sample_dag.py from the repo. I am using the quickstart docker images and followed the procedure for airflow integration using this. Attached all the relevant logs and screenshots. Appreciate any help on this. I was assuming I would be able to see the source of the data (mySQL DB where its ingested from and the airflow job) in this case as part of the lineage view.
Attached the Airflow DAG and the logs from Airflow and GMS showing the ingestion
g
Hi @many-pilot-7340. With the Airflow lineage backend, you need to annotate certain tasks with their inlets and outlets, and then that lineage will show up in DataHub - see https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py as an example
Some of the other metadata ingestion sources e.g. snowflake also automatically populate lineage data, but unfortunately mysql is not one of them
m
@gray-shoe-75895 Thankyou. It does work with inlets and outlets configured now for mysql. Is there a documented list of sources where the lineage is automatically populated ?
@gray-shoe-75895 For those sources that do support auto population of lineage, is it necessary to use Airflow for ingestion, or can I just use the CLI (datahub ingest -c <recipe-file> and that would auto-populate the lineage as well ?
@gray-shoe-75895 For those sources that do support auto population of lineage, if we do want to use Airflow for ingestion, Eg snowflake, would the attached sample DAG work so that lineage would be auto populated or any additional changes required ?
g
For sources like snowflake that auto populate lineage, it is not necessary to use Airflow for ingestion - the CLI will also work just fine
thank you 1
The source docs will include information about lineage if it is available e.g. https://datahubproject.io/docs/metadata-ingestion/source_docs/snowflake/#capabilities
thank you 1
m
@gray-shoe-75895 is there an Api as well exposed that can be used for the ingestion ? Any pointers where to find it ? I see the gms logs point to this API when I ingest through CLI. Is this API exposed to be called from a script maybe - and where is the exact format documented ? 193255.904 [pool-8-thread-1] INFO c.l.m.filter.RestliLoggingFilter:56 - POST /entities?action=ingest - ingest - 200 - 15ms 193255.916 [qtp544724190-10] INFO c.l.metadata.entity.EntityService:549 - INGEST urn urnlidataset:(urnlidataPlatform:mysql,performance_schema.users,PROD) with system metadata {lastObserved=1640028767335, runId=mysql-2021_12_20-11_32_26}
@gray-shoe-75895 Also, with the below recipe, I see that it is ingesting lot more data, rather than just the database specified in the recipe. Is this correct if I want to ingest just one database from mysql ?
Copy code
---
# see <https://datahubproject.io/docs/metadata-ingestion/source_docs/mysql> for complete documentation
source:
  type: "mysql"
  config:
    host_port: localhost:3306
    database: world_x
    username: datahub
    password: datahub
    include_tables: True
    include_views: True

# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
g
There is an API for ingestion at
/entities?action=ingest
and the docs are here: https://datahubproject.io/docs/metadata-service/#restli-api. You can also programmatically call that API through python - there’s a bunch of examples here https://datahubproject.io/docs/lineage/sample_code
For the second question about mysql: it seems like a bug if it is ingesting more data than expected, but @helpful-optician-78938 can probably confirm
h
Hi @many-pilot-7340, could you elaborate a bit more on "ingesting lot more data"? What additional data are you seeing?
m
Hi @helpful-optician-78938, attached is my recipe file stating the database to be ingested as world_x from mysql. But as per the attached logs seems like its ingesting lot more data. I would expect only below to be ingested and written to sink for the given config. Is the config correct per the expectation ?
Copy code
---
# see <https://datahubproject.io/docs/metadata-ingestion/source_docs/mysql> for complete documentation
source:
  type: "mysql"
  config:
    host_port: localhost:3306
    database: world_x
    username: datahub
    password: datahub
    include_tables: True
    include_views: True

# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
[2021-12-21 10:57:10,977] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit world_x.city
[2021-12-21 10:57:11,009] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit world_x.country
[2021-12-21 10:57:11,046] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit world_x.countryinfo
[2021-12-21 10:57:11,083] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit world_x.countrylanguage
@gray-shoe-75895Thanks. Regarding the API for ingestion , how can I use this to say ingest all the data from a particular database in mySQL, like I do with the recipe above ? I did not spot any sample for that in the links, Any pointers on if thats possible, since I may not know the details of the data I am ingesting etc.? 
/entities?action=ingest
h
Hi @many-pilot-7340, thanks for reporting the issue. For now, you could use
schema_pattern.deny
to exclude other tables from getting ingested.
Regarding the ingest end-point, you can see this example.
m
@helpful-optician-78938 schema_pattern.deny: { "mysql", "information_schema", "performance_schema", "sys" }
Is this the correct syntax ? Getting this error 1 validation error for MySQLConfig schema_pattern.deny extra fields not permitted (type=value_error.extra)
h
You could simplify it by specifying only the allow pattern. Please follow the example here.
thank you 1
b
@gray-shoe-75895 Hi, https://firststr.com/2021/04/26/apache-hive-lineage-to-datahub/ and https://datahubproject.io/docs/lineage/sample_code These links cannot be accessed. May I ask if the link address has changed?
g