DataHub #ingestion

Join Slack

sparse-barista-40860

06/22/2022, 4:32 PM

error to

sparse-barista-40860

06/22/2022, 4:33 PM

pls tell me wich one of examples is compatible?

sparse-barista-40860

06/22/2022, 6:03 PM

im trying to deploy examples, pls tell me wich error i have in this flow?

sparse-barista-40860

06/22/2022, 6:03 PM

Copy code

nano /root/datahub/metadata-ingestion/examples/demo_data/recipe.dhub.yaml

##
source:
  type: bigquery
  config:
    # Coordinates
    project_id: my_project_id

    # `schema_pattern` for BQ Datasets
    schema_pattern:
      allow:
        - finance_bq_dataset

    table_pattern:
      deny:
        # The exact name of the table is revenue_table_name
        # The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing
        # project_id.dataset_name.table_name
        # We will improve this in the future
        - .*revenue_table_name

sink:
  # sink configs


##


datahub ingest -c /root/datahub/metadata-ingestion/examples/demo_data/recipe.dhub.yaml

[2022-06-22 13:00:20,415] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38.3
1 validation error for PipelineConfig
sink
  none is not an allowed value (type=type_error.none.not_allowed)

mammoth-monitor-62642

06/22/2022, 11:27 PM

Nice to meet you. My name is Takehiko and I work for a real estate tech business company. The company’s own DB is connected via SSH. Please let me know if you have a way to connect via SSH with DataHub. Thank you very much.

bitter-toddler-42943

06/23/2022, 2:58 AM

I think I have to move my question here.

bitter-toddler-42943

06/23/2022, 2:58 AM

Hello, I am trying to delete datahub dataset using datahub delete --env PROD --entity_type dataset but when I run the command above and collect it again, the deleted data seems to keep coming out. Do you know what command or option to run to delete all data? And I don't know what a URI is, but what should I do to create a command with the hard option by writing a URI?

proud-school-44110

06/23/2022, 3:43 PM

Hi Team, We are trying to create lineage where data sources could be files. From the example provided on Datahub, it looks like only a single file would exist on file platform. But in reality, we could have the same file across different servers and environments. So how can this be handled in datahub.

worried-motherboard-80036

06/23/2022, 6:24 PM

Hi, Does Datahub allow any automatic (ML based?) classification of data assets it discovers? if not, is there a plan to incorporate such a feature in the future?

big-plumber-87113

06/23/2022, 7:29 PM

hi team, when we ingest hive sources, is there an easy way to expose the

host_port

in the properties tab in UI? in our org we have a few hive servers and would like to know which host a table corresponds to.

worried-motherboard-80036

06/23/2022, 7:46 PM

Hi again, I've read about datahub's integration with Great Expectations, and how you can send assertions and their results to Datahub using the Python Rest emitter. I was wondering if there is any plan to be able to define quality tests from the UI, and perhaps run these as part of the Profiling process, or even at ingestion time?

dry-zoo-35797

06/23/2022, 9:11 PM

Hello, I am implementing custom transformer based on the document: https://datahubproject.io/docs/metadata-ingestion/transformers#writing-a-custom-transformer-from-scratch After running the recipe file, I am getting the below error: “Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /owners/0/owner :: “Provided urn “owners” is invalid” Here is my owners.json: { “owners”: [ { “owner”: “urnlicorpuser:mahbub” “type”: “DATAOWNER” } ] } Appreciate your response. Thanks, Mahbub

dry-zoo-35797

06/23/2022, 9:36 PM

Also, when I use out of the box type “pattern_add_dataset_ownership”, I can only use “ownership_type” = ’DATAOWNER”. All other ownership type returns with “Invalid ownership type”

microscopic-mechanic-13766

06/24/2022, 7:14 AM

Hi, I am trying to integrate datahub in a kerberized enviroment in which some tools like Kafka are also kerberized. Would any other configuration, on the datahub side, be needed to make datahub be able to connect to the kerberized tools??

elegant-salesmen-99143

06/24/2022, 8:40 AM

Hi people, Is there a way to mass add the descriptions for the columns and table names within a schema from a csv file or some other kind of document?

cuddly-arm-8412

06/25/2022, 3:17 AM

hi team,Our company has its own scheduling task system such as spark/dataflow/shell........ I want to import it into the datahub. Is there a corresponding model just like dataset for mysql/tidb?

lemon-zoo-63387

06/25/2022, 4:09 AM

Hello everyone, I need your help. This Hana DB has been running for three days. In fact, it has finished in five hours. In addition, clicking cancel can't close it. Thank you in advance for your help.

blue-beach-27940

06/26/2022, 9:46 AM

hello everyone, I am fresh in DataHub, so I meet some problem, the datahub version is 0.8.33, and I tested spark 2.4.0 with datahub, spark sql metadata is ingested to datahub, but I didn't know how to repair the following problem,

blue-beach-27940

06/26/2022, 9:49 AM

So how can I change the urn name of the spark sql, is there any idea? I have searched google, but can't get any right suggestion. very appreciate, thanks.

adamant-raincoat-65838

06/27/2022, 6:41 AM

Hi team, Is there any way to automatically delete the outdated dataset ? Ex. updated database should reflex in dataset

nutritious-bird-77396

06/27/2022, 3:17 PM

Is there a sample yaml file for domain ingestion? I would like to ingest a list of domains through a file using ingestions framework, any samples would be a great start

brief-cat-57352

06/27/2022, 3:58 PM

Hi team, for a Athena type yml recipe, is there a way to also pass the session token? I'm passing the access key and secret via username password.

plus1 1

bitter-oxygen-31974

06/28/2022, 3:55 AM

Hi all, I am setting up a pipeline to ingest data from AWS redshift to datahub. Curious to know, whether Metadata ingestion stores the whole data to disk before ingestion?

blue-beach-27940

06/28/2022, 6:03 AM

so where can I get the log?

wide-xylophone-61229

06/28/2022, 6:35 AM

docker-compose up --build

blue-beach-27940

06/28/2022, 7:03 AM

👍

plus1 1

blue-beach-27940

06/28/2022, 7:04 AM

my data is gone due to this action🤣

brief-cat-57352

06/28/2022, 7:51 AM

Hi all, I've followed this guide (https://datahubproject.io/docs/lineage/airflow) to enable datahub as my lineage backed for Airflow v2. Emitting metadata gets 401 when authentication is enabled. Where should I put the bearer access token? I tried putting it as extra parameter in the hook but no luck. Thanks in advance.

blue-beach-27940

06/28/2022, 8:04 AM

firstly, you should add blow configuration in the airflow.cfg:

blue-beach-27940

06/28/2022, 8:04 AM

[lineage] backend = datahub_provider.lineage.datahub.DatahubLineageBackend datahub_kwargs = { "datahub_conn_id": "datahub_rest_default", "cluster": "prod", "capture_ownership_info": true, "capture_tags_info": true, "graceful_exceptions": true } # The above indentation is important!