DataHub #ingestion

magnificent-plumber-63682

06/16/2023, 9:59 AM

Hi, I am trying to do ingestion locally on my system. I have created recipe for mysql, now I am trying to pass passowrd as a secret. But I am not getting how to generate secrets file in local system. Can anyone help me?

✅ 1

great-notebook-53658

06/19/2023, 2:26 AM

Hi, I am trying to ingest PowerBI metadata and getting error extra fields not permitted (type=value_error.extra). The datasource into PowerBI is Snowflake. The error I was getting as follows: Can anyone help ? Thanks!

shy-dog-84302

06/19/2023, 12:17 PM

Hi, a question related to BigQuery metadata ingestion.. can we prevent ingesting project entities when there are no datasets in it? and also prevent ingesting dataset entities when there no tables/views in it? Maybe a configurable flag to prevent this?

Copy code

[2023-06-19 12:07:32,796] INFO     {datahub.ingestion.source.bigquery_v2.bigquery:474} - Processing project: xyz-project
[2023-06-19 12:07:33,008] WARNING  {datahub.ingestion.source.bigquery_v2.bigquery:589} - No dataset found in xyz-project. Either there are no datasets in this project or missing bigquery.datasets.get permission. You can assign predefined roles/bigquery.metadataViewer role to your service account.
[2023-06-19 12:07:33,008] INFO     {datahub.ingestion.source.bigquery_v2.bigquery_report:95} - Time spent in stage <xyz-project: Metadata Extraction at 2023-06-19 12:07:32.796768+00:00>: 0.21 seconds

millions-city-84223

06/19/2023, 12:40 PM

Hi Team Sorry if it is not a right place for this message. We are using Datahub and one of the ingestion flow we currently have is to ingest data using file source. Current file source implementation allows us to read files from local fs and http(s). But we also need to ingest files located on AWS S3. Could you please clarify whether Datahub has (or maybe work in progress) AWS S3 file source support? If not, I would like to add AWS S3 file source support and, maybe, create some convenient mechanism to add any other file based source(like GCP, Azure, …), just implementing interface.

bland-application-65186

06/19/2023, 2:30 PM

Hi, question regarding OpenAPI ingestion. Is the ingestion OpenAPI definitions in YAML format in the roadmap?

✅ 1

strong-diamond-4751

06/19/2023, 7:59 PM

Howdy! I've got a bit of a weird question. Is it possible to set up a gitlab repository as an ingestion source? It would be neat to be able to document pipeline processes and whatnot from here.

✅ 1

icy-zoo-92866

06/20/2023, 7:18 AM

Hi, I am trying to Ingest data from Superset in Datahub. Sending request like this

Copy code

source:
    type: superset
    config:
        connect_uri: '<https://superset-xx.xx.xx/>'
        username: xxx
        password: xxx
        provider: db
        env: xxx

Request/Auth is successful but we are not getting any dashboards or charts back. When I login with same user and pwd to superset I can see all the charts what can be the issue..? TIA

miniature-hair-20451

06/20/2023, 8:00 AM

HI all. This bug affect me too - https://github.com/datahub-project/datahub/issues/6544 Now it's closed, please reopen.

✅ 1

jolly-airline-17196

06/20/2023, 11:52 AM

hey! had a little query, during ingestion with the FILE method, the path specified would be on the docker container running datahub-gms or the machine on which datahub containers are currently running on, although i tried both ways, always ended up with the following error

Copy code

raise Exception(f"Failed to process {path}")
Exception: Failed to process /home/datahub/students.json

✅ 1

ancient-queen-15575

06/20/2023, 1:21 PM

I’m seeing odd behaviour when ingesting dbt and Snowflake data. Column level lineage for a lot of tables is not appearing, some columns are greyed out and for some tables no columns are showing at all, as shown in first screenshot. In the second screenshot I clicked into the rightmost table, which wasn’t showing any columns in the Visualised Lineage view, and I see the full column list fine. Initially I was ingesting Snowflake first, then dbt and seeing no column level lineage. But after reading comments here I saw I should ingest dbt first and then Snowflake. Doing that has led to situation though. Anyone know what could be changed or how I could start debugging this? The dbt and Snowflake ingestions are running fine.

lively-raincoat-33818

06/20/2023, 3:48 PM

Hi folks, I'm working in the dbt ingestion and I want to have the compiled code of the queries in the view definition. Is that possible? For now If I used a macro function I only see the macro and not all the compiled query. I'm using the V0.10.3. Thanks in advance,

limited-cricket-18852

06/20/2023, 4:54 PM

Hi All! is there a way to parse/transform/ignore some containers when ingesting data? I am using the Databricks/unity-catalog source type and it results in generating some containers that I would like to not show up in Datahub. I get something like

Datasets/ prod/ databricks/ my_workspace/ global-euwest/ my_catalog/ some_layer/ my_beautiful_table

, however the

my_workspace

and

global-euwest

are not interesting for me. Is there a way to ingest without these information? Thanks!

✅ 1

bumpy-hamburger-47757

06/20/2023, 7:52 PM

I'm using the Python SDK -- is there a way to filter datasets by exact name matches? I'm using the

DataHubGraph.get_urns_by_filter()

and it's returning partial name matches for dataset name and column names. For example, if my query is

"test_table"

, it will return any datasets with the words

test

table

in the dataset name or columns (for example, a dataset named

users_table

or a column named

test_value

will match). Thanks!

✅ 1

average-nail-72662

06/20/2023, 9:15 PM

Hi guys, I’m new in datahub and I have a question to ask. When I did an ingestion Glue database, Can I upsert in properties metadata?

✅ 1

bland-orange-13353

06/21/2023, 12:47 AM

This message was deleted.

✅ 1

eager-monitor-4683

06/21/2023, 3:21 AM

Hi team, I tried to get the profiling from Redshift ingestion, but it's not working for external tables. Just want to know if there is any specific setting required? Thanks

refined-gold-30439

06/21/2023, 8:18 AM

Hi 👋 Can't I collect metadata from LookerStudio instead of Looker? • ingestion.yaml

Copy code

source:
    type: looker
    config:
        base_url: '<https://lookerstudio.google.com/>'
        client_id: '${looker_client_id}'
        stateful_ingestion:
            enabled: true
        client_secret: '${looker_client_secret}'

• Error

Copy code

[2023-06-21 08:17:34,423] INFO     {looker_sdk.rtl.requests_transport:72} - POST(<https://lookerstudio.google.com//api/4.0/login>)
[2023-06-21 08:17:35,243] ERROR    {datahub.entrypoints:199} - Command failed: Failed to configure the source (looker): Failed to connect/authenticate with looker - check your configuration: )]}'
{"errorStatus":{"code":9}}

gifted-bird-57147

06/21/2023, 9:10 AM

Hi Team, We are using an ingestion recipe to load Athena data to our catalog. There are no documentation properties in the Athena source, so I added documentation manually afterwards. However, when we rerun the ingestion recipe the documentation gets removed. What do I need to change in my receipe to keep the existing (manually edited) documentation?

Copy code

source:
  type: athena
  config:
    # Coordinates
    aws_region: eu-west-1
    work_group: ${ATHENA_WG_PROD_BDV}
    username: ${ATHENA_USER_BDV}
    password: ${ATHENA_PW_BDV}
    query_result_location: ${ATHENA_QL_BDV}
    ## vanwege een bug in de athena ingestion moeten we de database opgeven.
    ## Daarom aparte scripts per database (want je kunt maar 1 database per script specificeren...)
    database: "bdv-prod-topdesk-transformed"

    # Options
    #s3_staging_dir: ${ATHENA_QL}
    profiling:
      enabled: true
      turn_off_expensive_profiling_metrics: true
      include_field_distinct_count: true
      include_field_min_value: true
      include_field_max_value: true
      include_field_mean_value: true
      include_field_sample_values: true
      field_sample_values_limit: 2
      profile_if_updated_since_days: 10
    stateful_ingestion:
      enabled: true
      ignore_old_state: false
      ignore_new_state: false
      remove_stale_metadata: true
    env: PROD

pipeline_name: "BDV-prod-topdesk-transformed"


transformers: # an array of transformers applied sequentially
  - type: "pattern_add_dataset_terms"
    config:
      term_pattern:
        rules:
          ".*": ["urn:li:glossaryTerm:INTERN_OPEN"]
  - type: simple_add_dataset_tags
    config:
      tag_urns:
        - "urn:li:tag:Bedrijfsvoering"
        - "urn:li:tag:Topdesk"
        - "urn:li:tag:PROD"
        - "urn:li:tag:Transformed"
  - type: "simple_add_dataset_domain"
    config:
      replace_existing: true  # false is default behaviour
      domains:
        - "urn:li:domain:1ef9fa01-a415-46e2-93ad-f8ce3bf84537" # domein 'Bedrijfsvoering'

✅ 1

adorable-forest-52600

06/21/2023, 11:25 AM

Hi all, I successfully ingested two JSON-schema's, but with both I see "no data" when I want to see the schema. I can only see the raw JSON that I ingested when I click on raw, but it didn't extract the properties, type, descriptions, etc for the schema. With another JSON-schema, I was successful before. Does anyone know what can cause this, that the ingestion is Successful but that no metadata is retrieved?

✅ 1

lively-thailand-64294

06/21/2023, 2:58 PM

Hello Team!! I am new to datahub. I would like to know where the csv files for ingestion are supposed to be placed and where the recipe for ingestion is to be placed? I am running datahub on windows using docker and wsl2. Also Can the csv file be from dataset or does it need specific columns like resources.

✅ 1

rich-restaurant-61261

06/21/2023, 8:46 PM

Hi Team, I am trying to ingest data from superset, and base on the documentation https://datahubproject.io/docs/generated/ingestion/sources/superset/. The receipt using one variable is 'provider', is anyone know what is this variable? and what should I put over there? My superset and datahub is deployed through Kubernetes.

✅ 1

calm-helmet-89243

06/21/2023, 10:41 PM

Hi folks. When I use the Hive source, on the UI I see “Lineage” and “Queries” tabs are enabled even though there’s no data there. AFAIK I don’t emit any lineage or queries MCP events. Is there a way to disable these tabs? I’m thinking seeing these tabs would give users false hope that there’s something valuable there when there never will be (yet).

gifted-diamond-19544

06/22/2023, 7:08 AM

Hello! I am getting an Athena timeout error on my Athena ingestion. Any idea on how to deal with this?

Copy code

"Ingestion error: An error occurred (MetadataException) when calling the GetTableMetadata operation: Rate exceeded (Service: AmazonDataCatalog; Status Code: 400; Error Code: ThrottlingException

proud-dusk-671

06/22/2023, 7:41 AM

For ingesting into Snowflake, I have the following questions - 1. According to the diagram here, it seems that data is pulled into Metadata Ingestion which pushes it into the gms service. Does that mean there is no involvement of Kafka here? 2. Secondly, I would also like to know which component of Datahub does the service Metadata Ingestion belong to

✅ 1

creamy-pizza-80433

06/22/2023, 10:10 AM

Hello everyone, Recently we upgraded datahub version from 0.10.2 to 0.10.4 and we got a new problem regarding permissions and policies for users The permissions suddenly didn't work for any other entity except for Data Products. Does anyone know how can I solve this problem? Thanks!

modern-hospital-90979

06/22/2023, 2:03 PM

We have a question related to ingestion of Looker data. We've configured both the

looker

and

lookml

ingestion patterns and they appear to be pulling in most, if not all, of our assets in the platform. However, I'm unable to locate certain specific views that are defined in Looker ad Persistent Derived Tables (PDTs). Some PDTs show up, but others do not. It's unclear if there's a pattern to which ones show up and which do not. Have other users experienced challenges ingesting Looker PDTs?

strong-diamond-4751

06/22/2023, 3:41 PM

Hey there, I'm using the programmatic_pipeline.py to be able to configure and run a pipeline from within my py script. What is the proper syntax to add options? For example if my type is redshift, how to add include_table_lineage, include_views, etc

✅ 1

great-notebook-53658

06/23/2023, 7:59 AM

Hi, is it possible to define access policies to prevent certain users from accessing metadata by platform (e.g Snowflake)? I do not see in the https://datahubproject.io/docs/authorization/access-policies-guide/ that resource type = Platform or platform instance available and I do not see any drop down in the resource field when I select resource type=Container

✅ 1

great-notebook-53658

06/23/2023, 8:50 AM

Hi, any idea why I am getting the following error when trying to delete snowflake metadata using --platform option ? --urn is working but it’s tedious to delete by urn. Thanks!

billions-journalist-13819

06/23/2023, 8:57 AM

@famous-waitress-64616 A while ago, "only_ingest_assigned_metastore" was added and used among databricks ingest options. By the way, has this option disappeared? Can I use it again? i need this option..

✅ 1