https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • b

    bland-orange-13353

    04/18/2023, 6:48 AM
    This message was deleted.
    βœ… 1
    l
    • 2
    • 1
  • b

    breezy-kangaroo-27287

    04/18/2023, 9:06 AM
    Hi there, can I ingest xml metadata files of geodata using the ISO 19115/191139 Metadata standards?
    πŸ“– 1
    πŸ” 1
    l
    a
    • 3
    • 6
  • a

    agreeable-table-54007

    04/18/2023, 2:02 PM
    Hello can we ingest data with csv files ? or is it only the glossary term etc... ?
    πŸ“– 1
    βœ… 1
    πŸ” 1
    l
    a
    • 3
    • 3
  • e

    early-hydrogen-27542

    04/18/2023, 3:22 PM
    πŸ‘‹ all - to clarify, is column-level lineage only available for SnowFlake, Databricks, and Looker?
    πŸ“– 1
    βœ… 1
    πŸ” 1
    l
    a
    • 3
    • 2
  • m

    miniature-policeman-55414

    04/18/2023, 3:35 PM
    Hey πŸ‘‹ team, I am trying to ingest multiple glossary terms for a dbt model automatically with meta_mapping feature. following is the meta configuration of the dbt model.
    Copy code
    meta:
          owner: "@sree"
          terms_list: core_transport_gross_profit;core_adjusted_transport_gross_profit
    Following is the configuration for dbt metadata ingestion recipe:
    Copy code
    "meta_mapping": {
                            "term": {
                                "match": ".*",
                                "operation": "add_term",
                                "config": 
                                    {
                                        "term": "{{ $match }}"
                                    }
                            },
                            "terms_list": {
                                "match": ".*",
                                "operation": "add_terms",
                                "config": 
                                    {
                                        "seperator": ";"
                                    }
                            }  
                        },
    The glossary terms are defined in glossary YAML and are being ingested successfully (core_transport_gross_profit;core_adjusted_transport_gross_profit). So, now I dont see the glossary terms both added to my dbt model after metadata ingestion. Is this the right way to add multiple terms? Please correct or suggest me what is the right way?
    πŸ” 1
    βœ… 1
    πŸ“– 1
    l
    a
    f
    • 4
    • 6
  • g

    gray-airplane-39227

    04/18/2023, 6:28 PM
    Hello team, I have a question on metadata-model
    IngestionSource
    , currently only the property
    name
    is indexed and searchable, is there any reason property
    type
    is not searchable by default? Any concerns if I make a contribution to annotate
    type
    as a searchable field for
    IngestionSource
    ?
    Copy code
    record DataHubIngestionSourceInfo {
      /**
       * The display name of the ingestion source
       */
      @Searchable = {
       "fieldType": "TEXT_PARTIAL"
      }
      name: string
    
      /**
       * The type of the source itself, e.g. mysql, bigquery, bigquery-usage. Should match the recipe.
       */
      type: string
    πŸ“– 1
    βœ… 1
    πŸ” 1
    l
    a
    +3
    • 6
    • 9
  • e

    early-hydrogen-27542

    04/18/2023, 6:33 PM
    πŸ‘‹ all - We're thinking about applying stateful ingestion to several sources (specifically dbt and Redshift) that already exist. Does applying stateful ingestion delete the metadata of tables deleted prior to the application of stateful ingestion? This thread seems to indicate it should, if the versions are recent enough, but I've seen other threads that imply the opposite. We are on version 0.10.1.
    πŸ“– 1
    πŸ” 1
    βœ… 1
    l
    a
    h
    • 4
    • 3
  • l

    lively-dusk-19162

    04/18/2023, 8:42 PM
    Hi team, Hi Team, I am facing the certificate error as below. Anyone suggest me how to resolve that Could not resolve all files for configuration 'datahub frontendcompileClasspath'. > Could not download javax.ws.rs-api-2.0.1.jar (javax.ws.rs:javax.ws.rs-api:2.0.1) > Could not get resource 'https://plugins.gradle.org/m2/javax/ws/rs/javax.ws.rs-api/2.0.1/javax.ws.rs-api-2.0.1.jar'. > Could not GET 'https://plugins.gradle.org/m2/javax/ws/rs/javax.ws.rs-api/2.0.1/javax.ws.rs-api-2.0.1.jar'. > The server may not support the client's requested TLS protocol versions: (TLSv1.2, TLSv1.3). You may need to configure the client to allow other protocols to be used. See: https://docs.gradle.org/6.9.2/userguide/build_environment.html#gradle_system_properties > PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target > Could not download osgi-resource-locator-1.0.1.jar (org.glassfish.hk2osgi resource locator1.0.1) > Could not get resource 'https://plugins.gradle.org/m2/org/glassfish/hk2/osgi-resource-locator/1.0.1/osgi-resource-locator-1.0.1.jar'. > Could not GET 'https://plugins.gradle.org/m2/org/glassfish/hk2/osgi-resource-locator/1.0.1/osgi-resource-locator-1.0.1.jar'. > The server may not support the client's requested TLS protocol versions: (TLSv1.2, TLSv1.3). You may need to configure the client to allow other protocols to be used. See: https://docs.gradle.org/6.9.2/userguide/build_environment.html#gradle_system_properties > PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target > Could not download paranamer-2.8.jar (com.thoughtworks.paranamerparanamer2.8) > Could not get resource 'https://plugins.gradle.org/m2/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar'. > Could not GET 'https://plugins.gradle.org/m2/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar'. > The server may not support the client's requested TLS protocol versions: (TLSv1.2, TLSv1.3). You may need to configure the client to allow other protocols to be used. See: https://docs.gradle.org/6.9.2/userguide/build_environment.html#gradle_system_properties > PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target I am getting the above error.anyone please help ne to resolve that?
    l
    a
    • 3
    • 2
  • a

    adamant-sugar-28445

    04/19/2023, 2:08 AM
    Not showing table as table in lineage Hi Team. When I read an HDFS-stored table using
    spark.sql("select * from db.tableX")
    and tableX is already in datahub in the Hive platform, datahub lineage shows this input as an hdfs path rather than a Hive table. How can I make it present the table input as a table input?
    l
    a
    • 3
    • 5
  • l

    late-arm-1146

    04/19/2023, 3:54 AM
    Hi everyone, I am using
    csv-enricher
    with v0.8.45. I noticed that the resource description I provide in the csv overwrites the existing description even if I don't set the
    write_semantics
    to OVERRIDE. Is this expected behavior for description?
    πŸ“– 1
    βœ… 1
    πŸ” 1
    l
    b
    • 3
    • 5
  • b

    breezy-kangaroo-27287

    04/19/2023, 8:50 AM
    Hi there, I try to ingest a single xml file into datahub for testing. Converted it into a json file for that. My problem at the moment is that I don't know where to put that source file. You should give the path, but the path in my ubuntu operating system is not working. I guess the path refers to one of the docker containers datahub is runnig in, but which one? Or do I have to mount something? Sorry if this is a stupid noob question, but I'm not so deep into docker etc. .. I use the quickstart version based on docker at the moment.
    πŸ” 1
    βœ… 1
    πŸ“– 1
    l
    • 2
    • 3
  • r

    rapid-airport-61849

    04/19/2023, 9:30 AM
    Copy code
    datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (mssql): No module named 'pyodbc'
    Hello!!! Have you ever seen that error guys? I am using quickstart docker.
    πŸ“– 1
    l
    b
    • 3
    • 3
  • l

    late-furniture-56629

    04/19/2023, 9:50 AM
    Hi. πŸ™‚ I would like to ask you if do any of you are able to connect to ingestion via datahub cli from your local computer? What I did: 1. datahub init 2. I filled proper my datahub host + token 3. run command:
    Copy code
    datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:mssql,UnifiedJobs.dbo.AccountCompanyMapping,PROD)"
    And I got error:
    Copy code
    raise JSONDecodeError("Expecting value", s, err.value) from None
    json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
    If somebody had this problem and how did you fix it? At the end I would like to be able dump all ingestion to some file - as a backup πŸ™‚
    l
    a
    • 3
    • 3
  • a

    agreeable-table-54007

    04/19/2023, 10:22 AM
    Hi all. How do you ingest some json data ? Like are there any exemples ? If so do you have a link ? And also what would the YAML recipe be ?
    πŸ“– 1
    πŸ” 1
    βœ… 1
    l
    a
    • 3
    • 2
  • b

    brainy-oxygen-20792

    04/19/2023, 10:25 AM
    Hello DataHub community β˜€οΈ I'm ingesting Looker and LookML into a local Docker instance of DataHub, but I'm finding that I'm missing some Explores. The Datahub service user I've set up has all of the permissions stated in the docs, and if I sudo as the service user on Looker I can access the Explore. I also don't see any errors in the ingestion process to indicate that it tried and failed. If I'm interpreting the code correctly, I think that Explores are obtained indirectly by crawling the dashboards, therefore Explores that aren't being used for dashboards (or, not by any that DH has access to) would be excluded. Is this correct? And is there anything I can do to include these Explores?
    βœ… 2
    πŸ“– 1
    πŸ” 1
    l
    a
    h
    • 4
    • 7
  • a

    adamant-sugar-28445

    04/19/2023, 10:34 AM
    Use datahub GraphQL to ingest data lineage in Spark agent As far as I know, the datahub graphql API allows us to add upstream and downstream data. That's why I want to try using part of the code in the datahub Spark agent to get the input and output, and build (hard-code in my case) the urns, and put them as parameters in the graphql API. I think this is possible, but I'm not clear about its correctness in terms of architecture and maintenance. cc @careful-pilot-86309, @loud-island-88694
    l
    a
    • 3
    • 3
  • a

    agreeable-table-54007

    04/19/2023, 1:16 PM
    Hello community. Is there anyone here that already have ingested pipelines from Azure Data Factory into DataHub ? Thanks.
    πŸ” 1
    πŸ“– 1
    l
    a
    +2
    • 5
    • 8
  • g

    gifted-diamond-19544

    04/19/2023, 1:56 PM
    Hello all! We have upgraded to datahub version
    v0.10.2
    , with the actions container on version
    v0.0.12
    . When we changed the cli version to match our server version, our tableau ingestion setup via the UI started failing with the following error:
    Copy code
    Failed to find a registered source for type tableau: 'str' object is not callable
    Anyone got the same problem? cc @ancient-ocean-36062
    πŸ” 1
    πŸ“– 2
    πŸ‘€ 4
    πŸ‘ 1
    l
    f
    +6
    • 9
    • 14
  • s

    strong-parrot-78481

    04/19/2023, 8:57 PM
    Hi Everyone, is it possible to pass into custom pipeline config encrypted password and put somehow any decryption logic. For example I have json config passing to pipeline where password: encrypted password (I dont use datahub database so dont store any sources, credentials in datahub)
    l
    a
    • 3
    • 2
  • m

    microscopic-machine-90437

    04/20/2023, 9:37 AM
    Hello Everyone, I'm trying to ingest Snowflake metadata (I have my datahub setup in Kubernetes Cluster) and while doing so I'm getting storage error (
    Copy code
    ERROR: The ingestion process was killed, likely because it ran out of memory. You can resolve this issue by allocating more memory to the datahub-actions container.
    When I go through the values.yml file, I could see that the datahub-actions container has 512 Mi as memory. My questions is, when we ingest metadata, in which container it will be stored. If the data from snowflake we are trying to ingest is in GBs, how large we have to scale our memory in the actions container? Is there a way to find out what is the size of the data/metadata we are trying to ingest(from snowflake/any other source). Can someone help me with this.
    πŸ” 1
    πŸ“– 1
    βœ… 1
    l
    m
    • 3
    • 5
  • q

    quiet-television-68466

    04/20/2023, 11:05 AM
    Apologies for another Airflow request from me have gotten pretty stuck on this one and I think would be useful for others down the line! We’ve been able to get our lineage emitting properly when they are set manually as so and this shows up in Datahub.
    Copy code
    BashOperator(
        task_id="run_data_task",
        dag=dag,
        bash_command="echo 'hello world'",
        owner='<mailto:john.claro@checkout.com|john.claro@checkout.com>',
        inlets=[
            Dataset("snowflake", "data_platform.cfirth_test.CFIRTH_TEST_UPLOAD"),
            # You can also put dataset URNs in the inlets/outlets lists.
        ],
        outlets=[Urn("urn:li:dataset:(urn:li:dataPlatform:snowflake,landing.data_platform.cfirth_tbl,PROD)")],
    )
    Currently, we are trying to set up our custom DbtOperator to automatically parse the lineage of Dbt jobs using the manifest file, and then set them as inlets/outlets respectively.
    Copy code
    if self.operation == 'run':
        inlets, outlets = self._get_lineage() # this works as expected and returns lists of Dataset('snowflake', '<snowflake_table_name>')
        self.add_inlets(inlets)
        <http://self.log.info|self.log.info>(f"Added inlets: {self.get_inlet_defs()}")
        self.add_oulets(outlets)
        <http://self.log.info|self.log.info>(f"Added outlets: {self.get_outlet_defs()}")
    Airflow picks up the inlets and outlets correctly (as seen in the logs here)
    Copy code
    Added inlets: [Dataset(platform='snowflake', name='<db>.<schema>.<table>', env='PROD'), ...]
    But when they are emitted to Datahub, (logs in 🧡)it looks like there’s nothing happening lineage wise. Anyone have any ideas?
    l
    a
    +2
    • 5
    • 10
  • n

    numerous-byte-87938

    04/20/2023, 5:57 PM
    I learned today that v0.9.6 has removed the dependencies from MCE consumer to GMS (through this PR), so MCE consumer starts to talk to MySQL and ES directly instead of going through GMS anymore. While trying to understand the reasoning behind it, this comment makes me feel that we are taking away the ingestion pressure from GMS so it can serve GQL traffic better. And this leads to three Qs from mine: 1. Are we heading a direction that GMS is only on the read path? 2. Does GMS still make any writes directly to MySQL in versions after v0.9.6? 3. Is there any direct perf gain after stopping MCE making Restli calls to GMS in the MCP ingestion workflow? Thanks!
    βœ… 1
    l
    a
    +2
    • 5
    • 5
  • a

    adorable-megabyte-63781

    04/21/2023, 8:48 AM
    Any help on how to ingest Tableau server into Datahub? Have been trying to do it since 1 weeks , but it seems to be failing with below error. Keep getting this as the reason of failure:
    Copy code
    RuntimeError: Query workbooksConnection error: [{'message': "Validation error of type FieldUndefined: Field 'projectLuid' in type 'Workbook' is undefined @ 'workbooksConnection/nodes/projectLuid'", 'locations': [{'line': 9, 'column': 7, 'sourceName': None}], 'description': "Field 'projectLuid' in type 'Workbook' is undefined", 'validationErrorType': 'FieldUndefined', 'queryPath': ['workbooksConnection', 'nodes', 'projectLuid'], 'errorType': 'ValidationError', 'path': None, 'extensions': None}]
    Here is my recipe for reference .source: type: tableau config: connect_uri: 'tableau_url' ssl_verify: false stateful_ingestion: enabled: false site: site_name project_pattern: allow: - default ignoreCase: true username: '${my_username}' password: '${my_password}'
    πŸ“– 1
    πŸ” 1
    βœ… 1
    l
    h
    a
    • 4
    • 19
  • m

    microscopic-machine-90437

    04/21/2023, 9:15 AM
    Hello Everyone, I have a tableau ingestion which is scheduled 4 times a day and all of them were successfully executed yesterday. However, I see the status as 'failed' as the last status on the UI. Last execution time is also wrong. Can someone help me with this.
    l
    d
    • 3
    • 3
  • h

    hundreds-airline-29192

    04/21/2023, 9:55 AM
    Hello everyone , i just deploying datahub and set connection to Big Quey. The connection is ok but when i ingest metadata from big query to datahub , it got error below.Can anybody help me ?
    l
    d
    f
    • 4
    • 13
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please help me
    l
    • 2
    • 1
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please
1...116117118...144Latest