Hi, I am running the following command helm upgrad...
# troubleshoot
h
Hi, I am running the following command helm upgrade --install --namespace default --create-namespace datahub ./datahub -f ./datahub/values.y*ml -f ./datahub/values-dev.y*ml using the 0.8.16 version of the code. The frontend UI ends up with only Datasets, Dashboards and charts. No pipelines tab. Any directions on what I could be doing wrong ?
e
Have you ingested any pipelines?
From 0.8.16 onwards we only show entity types with entities ingested (except for the 3 shown above - which are enabled by default)
h
No Dexter, we have not ingested any pipelines yet
e
Yeah. We wanted to reduce clutter on the home page, so removed all entity types that have not been ingested
It will appear the moment you ingest pipelines!
h
@early-lamp-41924 - No change even after Datasets were ingested.
e
Yeah. you need to ingest pipelines
h
How do we ingest pipelines ( I am confused )
e
Pipelines are things like airflow pipelines that you ingest. ones that process datasets and generate new ones
h
We are using Airflow DAG's to ingest metadata
e
So just to clarify
that card used to return nothing right?
We just removed the cards where before you would click into it and return nothing
bc no entities of that type was ingested
this reduces clutter
e
Using dag to ingest, doesn’t ingest information about the dag itself
h
We configured to emit lineage for Airflow . steps outlined in the above page
e
So not using lineage backend right?
Yeah in that case, it just emits an edge between datasets not the pipeline itself
h
So the with configuration and generic_recipe_sample_dag.py - would ingest the metadata . We are using the following configuration - • Install plugin on Airflow
pip install acryl-datahub[airflow]
• Create Airflow hooks for Datahub 
# For REST-based:
airflow connections add  --conn-type 'datahub_rest'
 
'datahub_rest_default'
 
--conn-host '<http://localhost:8080>'
# For Kafka-based (standard Kafka sink config can be passed via extras):
airflow connections add  --conn-type 'datahub_kafka'
 
'datahub_kafka_default'
 
--conn-host 'broker:9092'
 
--conn-extra '{}'
• Configuration to airflow.cfg
[lineage]
backend = datahub_provider.lineage.datahub.DatahubLineageBackend
datahub_kwargs = {
    
"datahub_conn_id": "datahub_rest_default",
    
"cluster": "prod",
    
"capture_ownership_info": *true*,
    
"capture_tags_info": *true*,
    
"graceful_exceptions": *true*
 
}
# The above indentation is important!
• Plugins – Need to be installed like
pip install 'acryl-datahub[kafka]'
With above configurations I see following logs in the previous environment , but the not the same in new environment (highlighted portion is missing) -
Copy code
[2021-11-18 18:03:09,766] {pipeline.py:44} INFO - sink wrote workunit d.a
[2021-11-18 18:03:09,835] {pipeline.py:44} INFO - sink wrote workunit d.b
[2021-11-18 18:03:09,911] {pipeline.py:44} INFO - sink wrote workunit d.c
[2021-11-18 18:03:09,913] {python.py:151} INFO - Done. Returned value was: None
[2021-11-18 18:03:09,920] {_lineage_core.py:220} INFO - DataHub lineage backend - emitting metadata:
{"auditHeader": null, "proposedSnapshot": {"com.linkedin.pegasus2avro.metadata.snapshot.DataFlowSnapshot": {"urn": "urn:li:dataFlow:(airflow,datahub_glue_ingestion,prod)", "aspects": [{"com.linkedin.pegasus2avro.datajob.DataFlowInfo": {"customProperties": {"start_date": "1637020800.0", "tags": "['airflow_tagging']", "catchup": "False", "timezone": "'UTC'", "fileloc": "'/home/ec2-user/airflow/dags/datahub_glue_ingestion.py'", "_concurrency": "64", "is_paused_upon_creation": "None", "_access_control": "None", "_default_view": "'tree'"}, "externalUrl": "https://<airflowhostname>:443/tree?dag_id=datahub_glue_ingestion", "name": "datahub_glue_ingestion", "description": "An example DAG which ingests metadata from Glue to DataHub\n\n", "project": null}}, {"com.linkedin.pegasus2avro.common.Ownership": {"owners": [{"owner": "urn:li:corpuser:airflow", "type": "DEVELOPER", "source": {"type": "SERVICE", "url": "datahub_glue_ingestion.py"}}], "lastModified": {"time": 1637258587166, "actor": "urn:li:corpuser:airflow", "impersonator": null}}}, {"com.linkedin.pegasus2avro.common.GlobalTags": {"tags": [{"tag": "urn:li:tag:airflow_tagging"}]}}]}}, "proposedDelta": null, "systemMetadata": null}
{"auditHeader": null, "proposedSnapshot": {"com.linkedin.pegasus2avro.metadata.snapshot.DataJobSnapshot": {"urn": "urn:li:dataJob:(urn:li:dataFlow:(airflow,datahub_glue_ingestion,prod),ingest_from_glue)", "aspects": [{"com.linkedin.pegasus2avro.datajob.DataJobInfo": {"customProperties": {"task_id": "'ingest_from_glue'", "execution_timeout": "7200.0", "email": "['x']", "_downstream_task_ids": "[]", "_inlets": "[]", "label": "'ingest_from_glue'", "_outlets": "[]", "_task_type": "'PythonOperator'", "_task_module": "'airflow.operators.python'", "start_date": "datetime.datetime(2021, 11, 16, 0, 0, tzinfo=Timezone('UTC'))", "trigger_rule": "'all_success'", "depends_on_past": "False", "end_date": "None", "sla": "None", "wait_for_downstream": "False"}, "externalUrl": "https://<airflowhostname>:443/taskinstance/list/?flt1_dag_id_equals=datahub_glue_ingestion&_flt_3_task_id=ingest_from_glue", "name": "ingest_from_glue", "description": null, "type": {"string": "COMMAND"}, "flowUrn": null, "status": null}}, {"com.linkedin.pegasus2avro.datajob.DataJobInputOutput": {"inputDatasets": [], "outputDatasets": [], "inputDatajobs": []}}, {"com.linkedin.pegasus2avro.common.Ownership": {"owners": [{"owner": "urn:li:corpuser:airflow", "type": "DEVELOPER", "source": {"type": "SERVICE", "url": "datahub_glue_ingestion.py"}}], "lastModified": {"time": 1637258587166, "actor": "urn:li:corpuser:airflow", "impersonator": null}}}, {"com.linkedin.pegasus2avro.common.GlobalTags": {"tags": [{"tag": "urn:li:tag:airflow_tagging"}]}}]}}, "proposedDelta": null, "systemMetadata": null}
[2021-11-18 18:03:09,932] {base.py:78} INFO - Using connection to: id: datahub_rest_default. Host: https://<host>, Port: None, Schema: , Login: , Password: None, extra: {}
[2021-11-18 18:03:09,936] {base.py:78} INFO - Using connection to: id: datahub_rest_default. Host: https://<host>, Port: None, Schema: , Login: , Password: None, extra: {}
Copy code
[2021-11-18 18:03:10,044] {taskinstance.py:1211} INFO - Marking task as SUCCESS. dag_id=datahub_glue_ingestion, task_id=ingest_from_glue, execution_date=20211118T180307, start_date=20211118T180309, end_date=20211118T180310
[2021-11-18 18:03:10,066] {taskinstance.py:1265} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-11-18 18:03:10,106] {local_task_job.py:149} INFO - Task exited with return code 0
e
Hmn so it is emitting pipelines
are we sure this ingestion is working? bc if these flows are ingested, the pipelines should show up
Try sth like
Copy code
curl --location --request POST '<http://localhost:8080/entities?action=search>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
    "input": "*",
    "entity": "dataFlow",
    "start": 0,
    "count": 10
}'
h
In the Older env I am getting the entity results, but not for the newer environment. Let me reconfigure Airflow and try running the DAG's (another team did the configuration, suspecting something went wrong there)
@early-lamp-41924 - Got the issue. Airflow version was 2.0.1 !
e
Awesome!!
👍 1