https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • a

    acceptable-restaurant-2734

    01/25/2023, 4:53 PM
    👋 Hello, team! Is there anyway to only ingest one dataset (and all tables/views within the dataset) in BigQuery. My project has many datasets and I am just interested in one.
    d
    • 2
    • 1
  • b

    bland-appointment-45659

    01/25/2023, 7:40 PM
    hi team, We are running our Snowflake ingestions on a schedule but UI is not showing the ingestion runs. Any pointers on what to check further ?
    b
    • 2
    • 7
  • p

    plain-france-42647

    01/25/2023, 7:47 PM
    Why does the BigQuery ingestion requires a jobs.create permission? can’t we extract the data with only “metadata viewer” permissions?
    ✅ 1
    d
    • 2
    • 6
  • b

    bland-lighter-26751

    01/25/2023, 8:46 PM
    hello, quick question hopefully! How come I have thousands of these assets? Anyone know what they are and is that normal?
    ✅ 1
    d
    • 2
    • 12
  • a

    able-evening-90828

    01/25/2023, 11:59 PM
    How can I delete tags stored inside
    Dataset.SchemaMetadata.fields.GlobalTags
    ? We added these tags during ingestion. Now, there is no
    X
    button next to the tags and none of the graphql api for updating tags or dataset affect
    Dataset.SchemaMetadata
    . They all seem to only affect
    Dataset.EditableSchemaMetadata
    .
    ✅ 1
    b
    • 2
    • 2
  • l

    little-lunch-35136

    01/26/2023, 6:04 AM
    Hi, All, not sure this has happened to anyone, we are running airflow, gx on snowflake tables. Following gx doc for snowflake connections string as
    Copy code
    snowflake://<USER_NAME>:<PASSWORD>@<ACCOUNT_NAME>/<DATABASE_NAME>/<SCHEMA_NAME>?warehouse=<WAREHOUSE_NAME>&role=<ROLE_NAME>&application=great_expectations_oss
    CONNECTION_STRING = f"snowflake://{sfUser}:{sfPswd}@{sfAccount}/DEV_ODS_DB/CBS_ODS?warehouse={wh}&role={role}&application=great_expectations_oss"
    dbname is DEV_ODS_DB and schema is CBS_ODS. Everything runs and gx datahub action successful, but NO assertion attached to table, further investigating seems the URN which gx sending to datahub is following
    Copy code
    urn:li:dataPlatform:snowflake,dev_ods_db/cbs_ods.cbs_ods.building_info,PROD)
    instead of this as it is showing in datahub
    Copy code
    urn:li:dataPlatform:snowflake,dev_ods_db.cbs_ods.building_info,PROD
    so seems datahub action is mistaking two parts DEV_ODS_DB/CBS_ODS as the database name. Is this bug or some config I missed? thanks.
    g
    h
    • 3
    • 4
  • t

    thousands-bird-50049

    01/26/2023, 8:02 AM
    hi guys, I’m trying to use sqlalchemy with a custom dialect - this is what’s written in the documentation:
    Copy code
    The sqlalchemy source is useful if we don't have a pre-built source for your chosen database system, but there is an SQLAlchemy dialect defined elsewhere. In order to use this, you must pip install the required dialect packages yourself.
    pip install where? I tried on the actions container but it doesn’t seem to recognize it
    d
    g
    • 3
    • 4
  • g

    gentle-xylophone-90287

    01/26/2023, 8:06 AM
    Hi Team, am trying to ingest hive meta from databricks, I followed the documentation but for some reason am getting ingestion error. Below is the yaml am using source: type: hive config: username: token stateful_ingestion: enabled: true host_port: 'https://adb-77777777777.17.azuredatabricks.net:443' profiling: profile_table_level_only: true enabled: true password: '${datahub-databricks}' scheme: databricks+pyhive options: connect_args: http_path: sql/protocolv1/o/7777777777777/1001-111143-7776fmgw
    d
    h
    • 3
    • 6
  • g

    gentle-xylophone-90287

    01/26/2023, 8:06 AM
    Can someone help me out here?
  • b

    better-orange-49102

    01/26/2023, 10:50 AM
    hello. needed some advice on the regex pattern, cos im pants at writing regex patterns: in my datahub DB I created a table, say temp1 with columns 1. aspect 2. version and added some values into that table now I want to profile both the metadata_aspect_v2 table and temp1, but I want to exclude profiling column "aspect" I used:
    Copy code
    profile_pattern:
      deny:
        - datahub.*.aspect
    but this has the effect of causing ingestion to skip the metadata_aspect_v2 table entirely when profiling. temp1 is profiled correctly. how should this be written to just exclude that one column?
    ✅ 1
    • 1
    • 1
  • q

    quiet-jelly-11365

    01/26/2023, 12:28 PM
    Hi all, I am trying to ingest delta lake from S3. Getting below error: PipelineInitError: Failed to find a registered source for type delta-lake: code() argument 13 must be str, not int My recipe,
    Copy code
    source:
      type: "delta-lake"
      config:
        env: "DEV"
        base_path: "<s3://data-loc/test-delta-table>"
        s3:
          aws_config:
            aws_access_key_id: "**"
            aws_secret_access_key: "**"
            aws_session_token: "**"
            aws_region: "**"
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    b
    d
    +2
    • 5
    • 11
  • a

    acceptable-restaurant-2734

    01/26/2023, 4:46 PM
    Hello, for BigQuery ingestion, is it possible to also ingest the first couple rows of a table to provide a preview of the entries?
    ✅ 1
    d
    • 2
    • 1
  • b

    blue-rainbow-97669

    01/26/2023, 8:48 PM
    Hi Team, We are emitting great Expectations validations result to Datahub using AssertionRunEventClass and while searching those results values in the table "metadata_aspect_v2", we are not able to get the information passed using AssertionRunEventClass under the aspect column, Can you help me in understanding where does the AssertionRunEvent aspect information is stored?
    ✅ 1
    h
    • 2
    • 1
  • r

    red-waitress-53338

    01/26/2023, 11:54 PM
    Hi Team, Me and my team are trying to ingest some tables from a Postgres database. The ingestion job ran fine but we cannot see the tables on the UI. There is one more strange behavior, initially when we started to ingest the data from Postgres we were able to ingest one table, then we created few more tables but are not able to ingest only those tables. We restarted GMS, ran the ingestion job again, but then again we were only able to see the first table that we ingested initially and not the other ones which we created later on, in the UI. Any help please?
    ✅ 1
    d
    g
    +4
    • 7
    • 30
  • r

    rhythmic-glass-37647

    01/27/2023, 1:31 AM
    cross posting here in case this is the appropriate place to ask for ingestion help 😅
    ✅ 1
  • b

    bright-beard-86474

    01/27/2023, 4:25 AM
    Hi Team. I’m testing ingestion from the Glue Catalog. I wonder what are the row_count, column_count parameters? Are they related to the Glue Crawler? Could someone please explain and add more info into this page? There are also the same columns in the Profiling. What’s the diff?
    ✅ 1
    d
    • 2
    • 3
  • a

    astonishing-animal-7168

    01/27/2023, 9:13 AM
    Hey team, we are testing ingestion from BigQuery, specifically: we've set up ingestion through the UI for two GCP project using two different service accounts (these have the same permissions in both projects). Ingestion completes successfully for both sources but for one no data shows up in the 'Queries' tab for all ingested tables. Both sources filter on specific BQ datasets. Do you have any ideas what may be causing this?
    d
    • 2
    • 3
  • s

    swift-diamond-21495

    01/27/2023, 10:11 AM
    Hello, guys. I see in documentation that datahub is able to ingest dbt
    sources.json
    as input for data freshness. And even more, it does the ingest well. But, where can I see the results of ingesting. There is no tab Stats for Source tables. Could you ping me where is it please. Should it be additionally configured somehow? Thank you
    ✅ 1
    d
    • 2
    • 3
  • a

    alert-fall-82501

    01/27/2023, 11:17 AM
    Hi Team - Working ingesting BQ to datahub , The ingestion has been failed because with errors . Can anybody suggest on this ?..Please check thread for logs
    d
    • 2
    • 4
  • i

    important-night-50346

    01/27/2023, 1:04 PM
    Hello. I have a silly question. We are going to allow different teams within company to ingest metadata into centralized Datahub using GMS-REST. What level of permissions is required to ingest metadata from let’s say Mysql database? Does it needs to be an admin level token or we can avoid giving such level of permissions?
    ✅ 1
    a
    • 2
    • 2
  • w

    witty-butcher-82399

    01/27/2023, 2:07 PM
    Hi all! Question about ownership and how to consistently manage updates from connectors and the UI (or the API). We are currently setting ownership at connector level using the simple add dataset ownership transform. This ensures that all datasets for a given connector has some ownership. However, we are planning to enable access to the API (or the UI) so users can provide fine-grained ownership (or fix the ownership if wrong). In that scenario, the ownership from ingestor becomes a sort of fallback value if not better ownership is given for a particular dataset. The problem is every execution of the connector will overwrite (or patch, it doesn’t matter, same effect) the ownership that was fixed via the API (or UI). So, how do we prevent this? Some new semantic in the transform so it skips doing any update from the API/UI? In other scenarios this has been solved with the separation of aspects, e.g.
    DatasetProperties
    and
    EditableDatasetProperties
    , should this approach be extended to
    Ownership
    too? if so, should this be scaled to any aspect as well as to support more than two authorities (beyond connector and UI)? Thanks
    a
    • 2
    • 3
  • r

    rhythmic-glass-37647

    01/27/2023, 5:25 PM
    Hello all, This is a pretty noob questions, but im still trying to understand all the patterns for setting up ingestions. Is there a programatic way to create the scheduled pipelines that you'd create in the ui? Its looking the the cli approach is for running the recipe, not necessarily creating one that can run on its own on the datahub server. Is the general architecture meant to be that ingestion tasks run in a distributed manner?
    g
    • 2
    • 3
  • v

    victorious-evening-88418

    01/27/2023, 8:31 PM
    Hi, I'm trying to ingest the PowerBI's metadata using this yaml file: --------------------------------- source: type: "powerbi" config: tenant_id: ########################## workspace_id: '##########################' env: DEV client_id: ########################## client_secret: ########################## scan_timeout: 600 extract_ownership: true dataset_type_mapping: PostgreSql: postgres Oracle: oracle SqlServer: mssql # see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation sink: type: "datahub-rest" config: server: "http://localhost:8080" --------------------------------- but I receive the error: --------------------------------- raise ValueError(f"PowerBI DataPlatform {key} is not supported") ValueError: PowerBI DataPlatform SqlServer is not supported --------------------------------- $ datahub --version acryl-datahub, version 0.9.6.1 Any suggestion? Thanks in advance for your help.
    ✅ 1
    g
    g
    • 3
    • 3
  • p

    plain-cricket-83456

    01/28/2023, 6:45 AM
    When I add
    profiling: enabled:true
    , the ingestion error will occur. Otherwise, it will succeed
    r
    h
    • 3
    • 22
  • t

    thousands-bird-50049

    01/29/2023, 8:31 AM
    is there any way to add a custom managed ingestion source to the UI without forking datahub?
    ✅ 1
    p
    h
    +2
    • 5
    • 18
  • p

    polite-activity-25364

    01/30/2023, 2:35 AM
    Hi team, i have a question for bigquery-ingestion. below image, CTE table lineage appeared view-type datasets. i don’t want appear this CTE(
    renamed;
    in image) option for that not founded in config details. Is there any way to exclude CTE from lineage? If not, I’m curious about the reason or intention of including it in lineage.
    h
    • 2
    • 4
  • s

    shy-dog-84302

    01/30/2023, 6:00 AM
    Hi! Do we have support for GCP workload identities in Datahub metadata ingestion from BigQuery?
    ✅ 1
    h
    • 2
    • 4
  • f

    fresh-zoo-34934

    01/30/2023, 7:47 AM
    Hi team, I tried to ingest LookML files from our repo. I’ve done all the steps like • adding the deploy key to the GitHub repo based on this tutorial • adding the ingestion to Datahub based using UI based on this tutorial but every time I run the ingestion it always resulting
    0 assets
    , looking at the log, it skips everything
    Copy code
    [2023-01-27 04:48:06,072] DEBUG    {datahub.ingestion.source.looker.lookml_source:1631} - Considering ProjectInclude(project='__BASE', include='/tmp/tmpdo_92aeflookml_tmp/044cd409-3941-4a73-abbf-da0edf74bc91/eg1.view.lkml') for model xendit-presto
    [2023-01-27 04:48:06,072] DEBUG    {datahub.ingestion.source.looker.lookml_source:1635} - Attempting to load view file: ProjectInclude(project='__BASE', include='/tmp/tmpdo_92aeflookml_tmp/044cd409-3941-4a73-abbf-da0edf74bc91/eg1.view.lkml')
    [2023-01-27 04:48:06,072] DEBUG    {datahub.ingestion.source.looker.lookml_source:1649} - view eg1 is not reachable from an explore, skipping..
    [2023-01-27 04:48:06,072] DEBUG    {datahub.ingestion.source.looker.lookml_source:1631} - Considering ProjectInclude(project='__BASE', include='/tmp/tmpdo_92aeflookml_tmp/044cd409-3941-4a73-abbf-da0edf74bc91/eg2.view.lkml') for model model1
    [2023-01-27 04:48:06,072] DEBUG    {datahub.ingestion.source.looker.lookml_source:1635} - Attempting to load view file: ProjectInclude(project='__BASE', include='/tmp/tmpdo_92aeflookml_tmp/044cd409-3941-4a73-abbf-da0edf74bc91/eg2.view.lkml')
    [2023-01-27 04:48:06,073] DEBUG    {datahub.ingestion.source.looker.lookml_source:1649} - view eg2 is not reachable from an explore, skipping..
    [2023-01-27 04:48:06,073] DEBUG    {datahub.ingestion.source.looker.lookml_source:1631} - Considering ProjectInclude(project='__BASE', include='/tmp/tmpdo_92aeflookml_tmp/044cd409-3941-4a73-abbf-da0edf74bc91/eg3.view.lkml') for model model1
    ...
    is there anything that perhaps I did miss in this process?
    ✅ 1
    h
    b
    • 3
    • 9
  • a

    alert-fall-82501

    01/30/2023, 10:17 AM
    Hi Team - Working on importing Airflow DAG jobs to datahub , my pipeline are running successfully . But The DAG jobs info are not coming on UI . Can anybody suggest on this ? ..Please check the logs in the thread ?
    h
    • 2
    • 37
  • r

    rich-policeman-92383

    01/30/2023, 1:41 PM
    Hello How can we migrate datahub entities/assets from one user to another. Use case: User A has left the organisation. User B is replacement of A. How i have thought of doing it using datahub graphql endpoints: 1. Fetch entities/assets of user A using the getSearchResultsForMultiple 2. Use mutation add owner to assign ownership to user B 3. If 2 is successful, remove user A from datahub Problem with this approach: 1. How do i know the exact value for count key in input json
    Copy code
    {
      "input": {
        "types": [],
        "query": "*",
        "start": 0,
        "count": 1000000,
        "filters": [],
        "orFilters": [
          {
            "and": [
              {
                "field": "owners",
                "condition": "EQUAL",
                "values": [
                  "urn:li:corpuser:userA"
                ],
                "negated": false
              }
            ]
          }
        ]
      }
    }
    2. Depending on the number of entities/assets returned we will have to send many requests to the GMS Is there a better way to achieve this ? datahub version: v0.9.5
    h
    • 2
    • 2
1...9899100...144Latest