DataHub #ingestion

eager-florist-67924

03/21/2022, 10:26 AM

Hi team I am using datahub v0.8.23 and trying to add domain entity by Java Emitter.

Copy code

private static void createTracingDomain() throws IOException, ExecutionException, InterruptedException {
    MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
            .entityType("domain")
            .entityUrn("urn:li:domain:tracing")
            .upsert()
            .aspect(new DomainProperties()
                    .setName("Tracing domain")
                    .setDescription("Domain for tracing")
            )
            .build();
    emit(mcpw);
}

but it get error 500:

Copy code

Caused by: java.lang.IllegalArgumentException: Failed to find entity with name domain in EntityRegistry

by checking the snapshot directory of given release on repo i am not able to see any domain entity. https://github.com/datahub-project/datahub/tree/v0.8.23/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot So how can i create a domain entity? Could you please provide some example? thx

hundreds-pillow-5032

03/21/2022, 2:50 PM

Hello everyone, was wondering if it were possible to discover databases+ metadata with a mssql recipe, with just the SQL server login info and an empty “database” field. Currently just seeing the system table metadata appear after ingestion.

handsome-football-66174

03/21/2022, 3:33 PM

Hi Everyone , Getting this error when ingesting Charts/Dashboards from Superset :

Copy code

File "/root/.venvs/airflow/lib/python3.8/site-packages/datahub/ingestion/source/sql/sql_common.py", line 62, in get_platform_from_sqlalchemy_uri
    if sqlalchemy_uri.startswith("bigquery"):
AttributeError: 'NoneType' object has no attribute 'startswith'
[2022-03-17 21:03:07,554] {taskinstance.py:1525} INFO - Marking task as FAILED. dag_id=metadata_ingestion_dag, task_id=ingest_metadata, execution_date=20220317T205757, start_date=20220317T210304, end_date=20220317T210307
[2022-03-17 21:03:07,623] {local_task_job.py:146} INFO - Task exited with return code 1

mysterious-australia-30101

03/21/2022, 3:52 PM

@here I am running ingestion and execution count is 67 , However on datahub UI , I can see only last 10 entries , e.g in my case execution count of ingestion is 67 , However I can see only last 10 entries , How to see old ingestion execution details. Appreciate your help !

plus1 1

worried-branch-76677

03/22/2022, 1:55 PM

Hi all, is there anyone looking at creating policies via the datahub python package? Is there currently anything to build

dataHubPolicy

? Things like

make_dataset_urn

? I am trying to create a MCP to ingest new policies by python.

astonishing-byte-5433

03/22/2022, 5:26 PM

Hey everyone, I'm struggling a bit with using the Oracle Ingestions together with the datahub quickstart setup. Do I need to install the cx_oracle package and Oracle Client on the host machine or into one of the docker containers? Tried to install and setup oracle-instantclient-basic-21.5.0.0.0-1.x86_64.rpm following the documentation on the host but getting:

Copy code

Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared '
           'object file: No such file or directory"

Thanks for your help

rich-policeman-92383

03/23/2022, 10:44 AM

ES ingestion fails with error "Missing Fields in elasticsearch mappings". Is there a way in cli to ignore the errors and ingest the metdata. In our setup we have few thousand indices and adding them to the ignore list would require running the datahub cli multiple times and updating the yml. This leads to a lot of rework.

brave-secretary-27487

03/23/2022, 12:11 PM

Hey, I was curiouse what the status is of looker stateful_ingestion. The documentation says it only support sql based ingestion. Can I assume that this will also be availble for other sources such as looker? https://datahubproject.io/docs/metadata-ingestion/source_docs/stateful_ingestion/

calm-sunset-28996

03/23/2022, 1:18 PM

Hey, We currently have an issue that we are overwriting manually (UI) added tags with our ingestion pipeline. Is it correct that there is still no no editable tags property? (source: thread) And if so, is this on the roadmap?

mysterious-lamp-91034

03/23/2022, 7:02 PM

Hi How to delete all data in production, including all entities like dataset, tag? Thanks

adorable-flower-19656

03/24/2022, 1:18 AM

Hi, I have datasets the following. I'd like to show only 'Project A' to a specific group of users in Datahub. What resource should I use for this and how can I set it up?

mysterious-nail-70388

03/24/2022, 3:05 AM

Hello everyone, What should I do if I plan to use local mysql and ES, not mysql and ES in Docker, but GMS service and UI in Docker? Do I need to create tables and indexes manually

stocky-noon-61140

03/24/2022, 9:50 AM

Hi, my question to Snowflake + Datahub experts: Is it possible to ingest the native Snowflake Object Tags into DataHub (without having to create metadata change events on your own)?

important-machine-62199

03/24/2022, 12:00 PM

Hello! I have dataset, resembles time-series type of aspect. Would like to ingest into DataHub & later run search/queries over REST. Would you point/help me with a json example/template, so that i extract & transform my (original the) dataset to be ingested over REST. Just wondering, did my Q making sense & the future(ingesting timeseries kind of data, their browsing & search) is available at present in DH.

mysterious-portugal-30527

03/24/2022, 5:32 PM

Greetings All! Trying to run a snowflake ingestion which succeed last time I executed but now I get error:

Copy code

'File "/tmp/datahub/ingest/venv-61422838-a72c-4c58-991f-380cbc29cafc/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line '
           '132, in get\n'
           '    115  def get(self, key: str) -> Type[T]:\n'
           ' (...)\n'
           '    128          raise ConfigurationError(\n'
           '    129              f"{key} is disabled; try running: pip install \'{__package_name__}[{key}]\'"\n'
           '    130          ) from tp\n'
           '    131      elif isinstance(tp, Exception):\n'
           '--> 132          raise ConfigurationError(\n'
           '    133              f"{key} is disabled due to an error in initialization"\n'

Thoughts??

modern-artist-55754

03/25/2022, 5:03 AM

hey team, just have a question. We are ingesting metadata from Snowflake, Tableau, DBT. When we extract the metadata, dataset names are in different casing convention, some UPPER, some lower and some mixed cases (because user input). This affects the lineage graph, is there any way to convert to consistent casing?

🤔 3

kind-teacher-18789

03/25/2022, 9:20 AM

hey guys,Is Lineage of DataHub implemented by listening for airflow logs?

few-grass-66826

03/25/2022, 10:26 AM

Hi guys can you pls send me working config example for snowflake ingestion, I did everything but still getting an error faild to execute datahub ingest

colossal-alligator-29986

03/25/2022, 3:53 PM

hello everyone!! first let me say this tool looks amazing so kudos and props 🥳 I’m trying to use the bigquery plugin to load lineage data and I’m unsuccessful in that it’s not showing up in the UI… in particular I ran the ingestion process and I’m seeing this

Copy code

[2022-03-25 15:09:59,922] INFO     {datahub.cli.ingest_cli:91} - Starting metadata ingestion
[2022-03-25 15:09:59,922] INFO     {datahub.ingestion.source.sql.bigquery:276} - Populating lineage info via GCP audit logs
[2022-03-25 15:09:59,928] INFO     {datahub.ingestion.source.sql.bigquery:369} - Start loading log entries from BigQuery start_time=2022-03-23T23:45:00Z and end_time=2022-03-26T00:15:00Z
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:380} - Finished loading 12047 log entries from BigQuery so far
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:462} - Parsing BigQuery log entries: number of log entries successfully parsed=12047
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:513} - Creating lineage map: total number of entries=12047, number skipped=1.
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:270} - Built lineage map containing 12015 entries.

polite-application-51650

03/28/2022, 7:05 AM

Hi guys, I've tried to setup Spark lineage for my local datahub instance but it does not seem to be generating the pipeline. Can anyone help me with it.

Copy code

spark = SparkSession.builder()
        .appName("test-application")
        .config("spark.master", "<spark://spark-master:7077>")
        .config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.23")
        .config("spark.extraListeners", "datahub.spark.DatahubSparkListener")
        .config("spark.datahub.rest.server", "<http://localhost:8080>")
        .enableHiveSupport()
        .getOrCreate();

this is what I setup in my spark config file.

hallowed-analyst-96384

03/28/2022, 10:07 AM

Hi everyone, just a quick question, how do you implement DatahubEmitterOperator in a Kubernetes Airflow?

stocky-midnight-78204

03/28/2022, 10:27 AM

I am trying to import prestosql(340) metadata and faced error message TrinoUserError: TrinoUserError(type=USER_ERROR, name=MISSING_CATALOG_NAME, message="line 26 Catalog must be specified when session catalog is not set", Below is my config: [6:09 PM] source: type: sqlalchemy config: platform: trino connect_uri: trino://xxx:xxx@xxxxx:8443/hive/xxx options: connect_args: catalog: hive schema: xxxx verify: False domain: "urnlidomain:xxxx": allow: - ".*" sink: type: datahub-rest config: server: 'http://xxxxx:8080' (edited)

shy-fireman-88724

03/28/2022, 3:09 PM

Hello everyone, I have a question. thinking If the description of a field (of a dataset, for example) is edited through the UI, this field's description is not overwritten with the value present in the source when you perform another ingestion. Is there a way to make Datahub have the opposite behavior: when I perform an ingestion, the values in the source override the fields that were edited in UI? Thank you.

bitter-toddler-42943

03/29/2022, 2:15 AM

'ERROR: Could not find a version that satisfies the requirement acryl-datahub[datahub-rest,mssql]==0.8.26.6 (from versions: none)\n'

'ERROR: No matching distribution found for acryl-datahub[datahub-rest,mssql]==0.8.26.6\n'

cold-hydrogen-10513

03/29/2022, 11:52 AM

hi, I’m trying to configure metadata ingestion from snowflake. I installed datahub

0.8.31

and created a recipe

Copy code

source:
    type: snowflake
    config:
        host_port: <http://nonprodcompanyname.us-east-1.snowflakecomputing.com|nonprodcompanyname.us-east-1.snowflakecomputing.com>
        warehouse: COMPANY_NAME_NON_PROD_VWH
        username: '${snowflake-user}'
        password: '${snowflake-pass}'
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms.datahub.svc.cluster.local:8080/api/gms>'

and I added it to the ingestion UI. When I execute it I get the following

Copy code

'ConfigurationError: Unable to connect to <http://datahub-gms.datahub.svc.cluster.local:8080/api/gms/config> with status_code: '
           '404. Please check your configuration and make sure you are talking to the DataHub GMS (usually <datahub-gms-host>:8080) or Frontend GMS '
           'API (usually <frontend>:9002/api/gms).\n'

Could you please tell me what I can check here?

mysterious-australia-30101

03/29/2022, 12:50 PM

Hi Team , I am in process of doing some test around semi-structured data such as CSV or Parquet files, Can someone help me please how to do perform ingestion for semi-structured data and some documentation around it?

fresh-memory-10355

03/29/2022, 4:15 PM

👋 Hello, team! i am new to this world i need to ingest datasets on to the datahub aws hosting please guide me how to achieve this please recommend me some best practices as well..

purple-ghost-64569

03/29/2022, 9:03 PM

Hi folks, I noticed that a next to no metadata seems to be extracted by the Superset extractor apart from Dashboards, Charts and associated Datasets. The Documentation tab is empty for Dashboards and Charts; Properties are empty for Dashboard. And the owner of these object is also not set. Furthermore, it seems that the Description of Charts is not transferred anywhere. • Is this the current state of implementation or am I doing anything wrong here? • How can I check what is being extracted? Can I see the record that has been transferred to DataHub somewhere? And a bit more specific question: It seems that Superset's Virtual Datasets are not respected nicely. They get represented as (Snowflake) tables but as they do not exist as physical tables/views, we cannot access the underlying SQL code. Where should I file this bug? At datahub-project's or at acryldata's DataHub github repo? Hoping that these many questions are not too overwhelming 🙂

better-orange-49102

03/30/2022, 7:25 AM

if a user submits a MCP/MCE via the Rest Endpoint today with metadata-authentication enabled, he will need to submit a token that he gets from the UI to verify that he is a user. Is it recorded anywhere (in MySQL) that the MCP is submitted by the user and not by someone else though? or does he need to add something in the MCP to ensure that the identity of the user actually shows up? I am interested in making sure that that no one can sneak in datasets without being accounted for. If not, I will need to start looking whitelisting IP addresses to the endpoint

astonishing-plumber-56128

03/30/2022, 7:54 AM

Hi guys, how can I remove glossary term from datahub?