https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • s

    sparse-barista-40860

    06/22/2022, 4:32 PM
    error to
  • s

    sparse-barista-40860

    06/22/2022, 4:33 PM
    pls tell me wich one of examples is compatible?
    o
    • 2
    • 2
  • s

    sparse-barista-40860

    06/22/2022, 6:03 PM
    im trying to deploy examples, pls tell me wich error i have in this flow?
  • s

    sparse-barista-40860

    06/22/2022, 6:03 PM
    Copy code
    nano /root/datahub/metadata-ingestion/examples/demo_data/recipe.dhub.yaml
    
    ##
    source:
      type: bigquery
      config:
        # Coordinates
        project_id: my_project_id
    
        # `schema_pattern` for BQ Datasets
        schema_pattern:
          allow:
            - finance_bq_dataset
    
        table_pattern:
          deny:
            # The exact name of the table is revenue_table_name
            # The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing
            # project_id.dataset_name.table_name
            # We will improve this in the future
            - .*revenue_table_name
    
    sink:
      # sink configs
    
    
    ##
    
    
    datahub ingest -c /root/datahub/metadata-ingestion/examples/demo_data/recipe.dhub.yaml
    
    [2022-06-22 13:00:20,415] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38.3
    1 validation error for PipelineConfig
    sink
      none is not an allowed value (type=type_error.none.not_allowed)
    p
    • 2
    • 5
  • m

    mammoth-monitor-62642

    06/22/2022, 11:27 PM
    Nice to meet you. My name is Takehiko and I work for a real estate tech business company. The company’s own DB is connected via SSH. Please let me know if you have a way to connect via SSH with DataHub. Thank you very much.
    e
    • 2
    • 8
  • b

    bitter-toddler-42943

    06/23/2022, 2:58 AM
    I think I have to move my question here.
  • b

    bitter-toddler-42943

    06/23/2022, 2:58 AM
    Hello, I am trying to delete datahub dataset using datahub delete --env PROD --entity_type dataset but when I run the command above and collect it again, the deleted data seems to keep coming out. Do you know what command or option to run to delete all data? And I don't know what a URI is, but what should I do to create a command with the hard option by writing a URI?
    m
    • 2
    • 9
  • p

    proud-school-44110

    06/23/2022, 3:43 PM
    Hi Team, We are trying to create lineage where data sources could be files. From the example provided on Datahub, it looks like only a single file would exist on file platform. But in reality, we could have the same file across different servers and environments. So how can this be handled in datahub.
    o
    k
    • 3
    • 6
  • w

    worried-motherboard-80036

    06/23/2022, 6:24 PM
    Hi, Does Datahub allow any automatic (ML based?) classification of data assets it discovers? if not, is there a plan to incorporate such a feature in the future?
    o
    • 2
    • 6
  • b

    big-plumber-87113

    06/23/2022, 7:29 PM
    hi team, when we ingest hive sources, is there an easy way to expose the
    host_port
    in the properties tab in UI? in our org we have a few hive servers and would like to know which host a table corresponds to.
    o
    • 2
    • 3
  • w

    worried-motherboard-80036

    06/23/2022, 7:46 PM
    Hi again, I've read about datahub's integration with Great Expectations, and how you can send assertions and their results to Datahub using the Python Rest emitter. I was wondering if there is any plan to be able to define quality tests from the UI, and perhaps run these as part of the Profiling process, or even at ingestion time?
  • d

    dry-zoo-35797

    06/23/2022, 9:11 PM
    Hello, I am implementing custom transformer based on the document: https://datahubproject.io/docs/metadata-ingestion/transformers#writing-a-custom-transformer-from-scratch After running the recipe file, I am getting the below error: “Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /owners/0/owner :: “Provided urn “owners” is invalid” Here is my owners.json: { “owners”: [ { “owner”: “urnlicorpuser:mahbub” “type”: “DATAOWNER” } ] } Appreciate your response. Thanks, Mahbub
    o
    h
    • 3
    • 3
  • d

    dry-zoo-35797

    06/23/2022, 9:36 PM
    Also, when I use out of the box type “pattern_add_dataset_ownership”, I can only use “ownership_type” = ’DATAOWNER”. All other ownership type returns with “Invalid ownership type”
    o
    h
    • 3
    • 5
  • m

    microscopic-mechanic-13766

    06/24/2022, 7:14 AM
    Hi, I am trying to integrate datahub in a kerberized enviroment in which some tools like Kafka are also kerberized. Would any other configuration, on the datahub side, be needed to make datahub be able to connect to the kerberized tools??
    e
    • 2
    • 3
  • e

    elegant-salesmen-99143

    06/24/2022, 8:40 AM
    Hi people, Is there a way to mass add the descriptions for the columns and table names within a schema from a csv file or some other kind of document?
    b
    e
    • 3
    • 15
  • c

    cuddly-arm-8412

    06/25/2022, 3:17 AM
    hi team,Our company has its own scheduling task system such as spark/dataflow/shell........ I want to import it into the datahub. Is there a corresponding model just like dataset for mysql/tidb?
    o
    • 2
    • 3
  • l

    lemon-zoo-63387

    06/25/2022, 4:09 AM
    Hello everyone, I need your help. This Hana DB has been running for three days. In fact, it has finished in five hours. In addition, clicking cancel can't close it. Thank you in advance for your help.
    o
    • 2
    • 1
  • b

    blue-beach-27940

    06/26/2022, 9:46 AM
    hello everyone, I am fresh in DataHub, so I meet some problem, the datahub version is 0.8.33, and I tested spark 2.4.0 with datahub, spark sql metadata is ingested to datahub, but I didn't know how to repair the following problem,
  • b

    blue-beach-27940

    06/26/2022, 9:49 AM
    So how can I change the urn name of the spark sql, is there any idea? I have searched google, but can't get any right suggestion. very appreciate, thanks.
    m
    • 2
    • 2
  • a

    adamant-raincoat-65838

    06/27/2022, 6:41 AM
    Hi team, Is there any way to automatically delete the outdated dataset ? Ex. updated database should reflex in dataset
    b
    i
    r
    • 4
    • 13
  • n

    nutritious-bird-77396

    06/27/2022, 3:17 PM
    Is there a sample yaml file for domain ingestion? I would like to ingest a list of domains through a file using ingestions framework, any samples would be a great start
    o
    g
    • 3
    • 7
  • b

    brief-cat-57352

    06/27/2022, 3:58 PM
    Hi team, for a Athena type yml recipe, is there a way to also pass the session token? I'm passing the access key and secret via username password.
    plus1 1
    o
    • 2
    • 2
  • b

    bitter-oxygen-31974

    06/28/2022, 3:55 AM
    Hi all, I am setting up a pipeline to ingest data from AWS redshift to datahub. Curious to know, whether Metadata ingestion stores the whole data to disk before ingestion?
    l
    • 2
    • 1
  • b

    blue-beach-27940

    06/28/2022, 6:03 AM
    so where can I get the log?
  • w

    wide-xylophone-61229

    06/28/2022, 6:35 AM
    docker-compose up --build
  • b

    blue-beach-27940

    06/28/2022, 7:03 AM
    👍
    plus1 1
  • b

    blue-beach-27940

    06/28/2022, 7:04 AM
    my data is gone due to this action🤣
  • b

    brief-cat-57352

    06/28/2022, 7:51 AM
    Hi all, I've followed this guide (https://datahubproject.io/docs/lineage/airflow) to enable datahub as my lineage backed for Airflow v2. Emitting metadata gets 401 when authentication is enabled. Where should I put the bearer access token? I tried putting it as extra parameter in the hook but no luck. Thanks in advance.
    d
    • 2
    • 2
  • b

    blue-beach-27940

    06/28/2022, 8:04 AM
    firstly, you should add blow configuration in the airflow.cfg:
    b
    • 2
    • 1
  • b

    blue-beach-27940

    06/28/2022, 8:04 AM
    [lineage] backend = datahub_provider.lineage.datahub.DatahubLineageBackend datahub_kwargs = { "datahub_conn_id": "datahub_rest_default", "cluster": "prod", "capture_ownership_info": true, "capture_tags_info": true, "graceful_exceptions": true } # The above indentation is important!
1...495051...144Latest