DataHub #ingestion

Join Slack

bland-orange-13353

04/18/2023, 6:48 AM

This message was deleted.

✅ 1

breezy-kangaroo-27287

04/18/2023, 9:06 AM

Hi there, can I ingest xml metadata files of geodata using the ISO 19115/191139 Metadata standards?

📖 1

🔍 1

agreeable-table-54007

04/18/2023, 2:02 PM

Hello can we ingest data with csv files ? or is it only the glossary term etc... ?

📖 1

✅ 1

🔍 1

early-hydrogen-27542

04/18/2023, 3:22 PM

👋 all - to clarify, is column-level lineage only available for SnowFlake, Databricks, and Looker?

📖 1

✅ 1

🔍 1

miniature-policeman-55414

04/18/2023, 3:35 PM

Hey 👋 team, I am trying to ingest multiple glossary terms for a dbt model automatically with meta_mapping feature. following is the meta configuration of the dbt model.

Copy code

meta:
      owner: "@sree"
      terms_list: core_transport_gross_profit;core_adjusted_transport_gross_profit

Following is the configuration for dbt metadata ingestion recipe:

Copy code

"meta_mapping": {
                        "term": {
                            "match": ".*",
                            "operation": "add_term",
                            "config": 
                                {
                                    "term": "{{ $match }}"
                                }
                        },
                        "terms_list": {
                            "match": ".*",
                            "operation": "add_terms",
                            "config": 
                                {
                                    "seperator": ";"
                                }
                        }  
                    },

The glossary terms are defined in glossary YAML and are being ingested successfully (core_transport_gross_profit;core_adjusted_transport_gross_profit). So, now I dont see the glossary terms both added to my dbt model after metadata ingestion. Is this the right way to add multiple terms? Please correct or suggest me what is the right way?

🔍 1

✅ 1

📖 1

gray-airplane-39227

04/18/2023, 6:28 PM

Hello team, I have a question on metadata-model

IngestionSource

, currently only the property

name

is indexed and searchable, is there any reason property

type

is not searchable by default? Any concerns if I make a contribution to annotate

type

as a searchable field for

IngestionSource

Copy code

record DataHubIngestionSourceInfo {
  /**
   * The display name of the ingestion source
   */
  @Searchable = {
   "fieldType": "TEXT_PARTIAL"
  }
  name: string

  /**
   * The type of the source itself, e.g. mysql, bigquery, bigquery-usage. Should match the recipe.
   */
  type: string

📖 1

✅ 1

🔍 1

early-hydrogen-27542

04/18/2023, 6:33 PM

👋 all - We're thinking about applying stateful ingestion to several sources (specifically dbt and Redshift) that already exist. Does applying stateful ingestion delete the metadata of tables deleted prior to the application of stateful ingestion? This thread seems to indicate it should, if the versions are recent enough, but I've seen other threads that imply the opposite. We are on version 0.10.1.

📖 1

🔍 1

✅ 1

lively-dusk-19162

04/18/2023, 8:42 PM

Hi team, Hi Team, I am facing the certificate error as below. Anyone suggest me how to resolve that Could not resolve all files for configuration 'datahub frontendcompileClasspath'. > Could not download javax.ws.rs-api-2.0.1.jar (javax.ws.rs:javax.ws.rs-api:2.0.1) > Could not get resource 'https://plugins.gradle.org/m2/javax/ws/rs/javax.ws.rs-api/2.0.1/javax.ws.rs-api-2.0.1.jar'. > Could not GET 'https://plugins.gradle.org/m2/javax/ws/rs/javax.ws.rs-api/2.0.1/javax.ws.rs-api-2.0.1.jar'. > The server may not support the client's requested TLS protocol versions: (TLSv1.2, TLSv1.3). You may need to configure the client to allow other protocols to be used. See: https://docs.gradle.org/6.9.2/userguide/build_environment.html#gradle_system_properties > PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target > Could not download osgi-resource-locator-1.0.1.jar (org.glassfish.hk2osgi resource locator1.0.1) > Could not get resource 'https://plugins.gradle.org/m2/org/glassfish/hk2/osgi-resource-locator/1.0.1/osgi-resource-locator-1.0.1.jar'. > Could not GET 'https://plugins.gradle.org/m2/org/glassfish/hk2/osgi-resource-locator/1.0.1/osgi-resource-locator-1.0.1.jar'. > The server may not support the client's requested TLS protocol versions: (TLSv1.2, TLSv1.3). You may need to configure the client to allow other protocols to be used. See: https://docs.gradle.org/6.9.2/userguide/build_environment.html#gradle_system_properties > PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target > Could not download paranamer-2.8.jar (com.thoughtworks.paranamerparanamer2.8) > Could not get resource 'https://plugins.gradle.org/m2/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar'. > Could not GET 'https://plugins.gradle.org/m2/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar'. > The server may not support the client's requested TLS protocol versions: (TLSv1.2, TLSv1.3). You may need to configure the client to allow other protocols to be used. See: https://docs.gradle.org/6.9.2/userguide/build_environment.html#gradle_system_properties > PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target I am getting the above error.anyone please help ne to resolve that?

adamant-sugar-28445

04/19/2023, 2:08 AM

Not showing table as table in lineage Hi Team. When I read an HDFS-stored table using

spark.sql("select * from db.tableX")

and tableX is already in datahub in the Hive platform, datahub lineage shows this input as an hdfs path rather than a Hive table. How can I make it present the table input as a table input?

late-arm-1146

04/19/2023, 3:54 AM

Hi everyone, I am using

csv-enricher

with v0.8.45. I noticed that the resource description I provide in the csv overwrites the existing description even if I don't set the

write_semantics

to OVERRIDE. Is this expected behavior for description?

📖 1

✅ 1

🔍 1

breezy-kangaroo-27287

04/19/2023, 8:50 AM

Hi there, I try to ingest a single xml file into datahub for testing. Converted it into a json file for that. My problem at the moment is that I don't know where to put that source file. You should give the path, but the path in my ubuntu operating system is not working. I guess the path refers to one of the docker containers datahub is runnig in, but which one? Or do I have to mount something? Sorry if this is a stupid noob question, but I'm not so deep into docker etc. .. I use the quickstart version based on docker at the moment.

🔍 1

✅ 1

📖 1

rapid-airport-61849

04/19/2023, 9:30 AM

Copy code

datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (mssql): No module named 'pyodbc'

Hello!!! Have you ever seen that error guys? I am using quickstart docker.

📖 1

late-furniture-56629

04/19/2023, 9:50 AM

Hi. 🙂 I would like to ask you if do any of you are able to connect to ingestion via datahub cli from your local computer? What I did: 1. datahub init 2. I filled proper my datahub host + token 3. run command:

Copy code

datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:mssql,UnifiedJobs.dbo.AccountCompanyMapping,PROD)"

And I got error:

Copy code

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

If somebody had this problem and how did you fix it? At the end I would like to be able dump all ingestion to some file - as a backup 🙂

agreeable-table-54007

04/19/2023, 10:22 AM

Hi all. How do you ingest some json data ? Like are there any exemples ? If so do you have a link ? And also what would the YAML recipe be ?

📖 1

🔍 1

✅ 1

brainy-oxygen-20792

04/19/2023, 10:25 AM

Hello DataHub community ☀️ I'm ingesting Looker and LookML into a local Docker instance of DataHub, but I'm finding that I'm missing some Explores. The Datahub service user I've set up has all of the permissions stated in the docs, and if I sudo as the service user on Looker I can access the Explore. I also don't see any errors in the ingestion process to indicate that it tried and failed. If I'm interpreting the code correctly, I think that Explores are obtained indirectly by crawling the dashboards, therefore Explores that aren't being used for dashboards (or, not by any that DH has access to) would be excluded. Is this correct? And is there anything I can do to include these Explores?

✅ 2

📖 1

🔍 1

adamant-sugar-28445

04/19/2023, 10:34 AM

Use datahub GraphQL to ingest data lineage in Spark agent As far as I know, the datahub graphql API allows us to add upstream and downstream data. That's why I want to try using part of the code in the datahub Spark agent to get the input and output, and build (hard-code in my case) the urns, and put them as parameters in the graphql API. I think this is possible, but I'm not clear about its correctness in terms of architecture and maintenance. cc @careful-pilot-86309, @loud-island-88694

agreeable-table-54007

04/19/2023, 1:16 PM

Hello community. Is there anyone here that already have ingested pipelines from Azure Data Factory into DataHub ? Thanks.

🔍 1

📖 1

gifted-diamond-19544

04/19/2023, 1:56 PM

Hello all! We have upgraded to datahub version

v0.10.2

, with the actions container on version

v0.0.12

. When we changed the cli version to match our server version, our tableau ingestion setup via the UI started failing with the following error:

Copy code

Failed to find a registered source for type tableau: 'str' object is not callable

Anyone got the same problem? cc @ancient-ocean-36062

🔍 1

📖 2

👀 4

👍 1

strong-parrot-78481

04/19/2023, 8:57 PM

Hi Everyone, is it possible to pass into custom pipeline config encrypted password and put somehow any decryption logic. For example I have json config passing to pipeline where password: encrypted password (I dont use datahub database so dont store any sources, credentials in datahub)

microscopic-machine-90437

04/20/2023, 9:37 AM

Hello Everyone, I'm trying to ingest Snowflake metadata (I have my datahub setup in Kubernetes Cluster) and while doing so I'm getting storage error (

Copy code

ERROR: The ingestion process was killed, likely because it ran out of memory. You can resolve this issue by allocating more memory to the datahub-actions container.

When I go through the values.yml file, I could see that the datahub-actions container has 512 Mi as memory. My questions is, when we ingest metadata, in which container it will be stored. If the data from snowflake we are trying to ingest is in GBs, how large we have to scale our memory in the actions container? Is there a way to find out what is the size of the data/metadata we are trying to ingest(from snowflake/any other source). Can someone help me with this.

🔍 1

📖 1

✅ 1

quiet-television-68466

04/20/2023, 11:05 AM

Apologies for another Airflow request from me have gotten pretty stuck on this one and I think would be useful for others down the line! We’ve been able to get our lineage emitting properly when they are set manually as so and this shows up in Datahub.

Copy code

BashOperator(
    task_id="run_data_task",
    dag=dag,
    bash_command="echo 'hello world'",
    owner='<mailto:john.claro@checkout.com|john.claro@checkout.com>',
    inlets=[
        Dataset("snowflake", "data_platform.cfirth_test.CFIRTH_TEST_UPLOAD"),
        # You can also put dataset URNs in the inlets/outlets lists.
    ],
    outlets=[Urn("urn:li:dataset:(urn:li:dataPlatform:snowflake,landing.data_platform.cfirth_tbl,PROD)")],
)

Currently, we are trying to set up our custom DbtOperator to automatically parse the lineage of Dbt jobs using the manifest file, and then set them as inlets/outlets respectively.

Copy code

if self.operation == 'run':
    inlets, outlets = self._get_lineage() # this works as expected and returns lists of Dataset('snowflake', '<snowflake_table_name>')
    self.add_inlets(inlets)
    <http://self.log.info|self.log.info>(f"Added inlets: {self.get_inlet_defs()}")
    self.add_oulets(outlets)
    <http://self.log.info|self.log.info>(f"Added outlets: {self.get_outlet_defs()}")

Airflow picks up the inlets and outlets correctly (as seen in the logs here)

Copy code

Added inlets: [Dataset(platform='snowflake', name='<db>.<schema>.<table>', env='PROD'), ...]

But when they are emitted to Datahub, (logs in 🧵)it looks like there’s nothing happening lineage wise. Anyone have any ideas?

numerous-byte-87938

04/20/2023, 5:57 PM

I learned today that v0.9.6 has removed the dependencies from MCE consumer to GMS (through this PR), so MCE consumer starts to talk to MySQL and ES directly instead of going through GMS anymore. While trying to understand the reasoning behind it, this comment makes me feel that we are taking away the ingestion pressure from GMS so it can serve GQL traffic better. And this leads to three Qs from mine: 1. Are we heading a direction that GMS is only on the read path? 2. Does GMS still make any writes directly to MySQL in versions after v0.9.6? 3. Is there any direct perf gain after stopping MCE making Restli calls to GMS in the MCP ingestion workflow? Thanks!

✅ 1

adorable-megabyte-63781

04/21/2023, 8:48 AM

Any help on how to ingest Tableau server into Datahub? Have been trying to do it since 1 weeks , but it seems to be failing with below error. Keep getting this as the reason of failure:

Copy code

RuntimeError: Query workbooksConnection error: [{'message': "Validation error of type FieldUndefined: Field 'projectLuid' in type 'Workbook' is undefined @ 'workbooksConnection/nodes/projectLuid'", 'locations': [{'line': 9, 'column': 7, 'sourceName': None}], 'description': "Field 'projectLuid' in type 'Workbook' is undefined", 'validationErrorType': 'FieldUndefined', 'queryPath': ['workbooksConnection', 'nodes', 'projectLuid'], 'errorType': 'ValidationError', 'path': None, 'extensions': None}]

Here is my recipe for reference .source: type: tableau config: connect_uri: 'tableau_url' ssl_verify: false stateful_ingestion: enabled: false site: site_name project_pattern: allow: - default ignoreCase: true username: '${my_username}' password: '${my_password}'

📖 1

🔍 1

✅ 1

microscopic-machine-90437

04/21/2023, 9:15 AM

Hello Everyone, I have a tableau ingestion which is scheduled 4 times a day and all of them were successfully executed yesterday. However, I see the status as 'failed' as the last status on the UI. Last execution time is also wrong. Can someone help me with this.

hundreds-airline-29192

04/21/2023, 9:55 AM

Hello everyone , i just deploying datahub and set connection to Big Quey. The connection is ok but when i ingest metadata from big query to datahub , it got error below.Can anybody help me ?

hundreds-airline-29192

04/21/2023, 10:10 AM

please help me

hundreds-airline-29192

04/21/2023, 10:10 AM

please

hundreds-airline-29192

04/21/2023, 10:10 AM

please

hundreds-airline-29192

04/21/2023, 10:10 AM

please

hundreds-airline-29192

04/21/2023, 10:10 AM

please