https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • r

    rapid-room-3050

    09/08/2021, 8:42 PM
    I am new in DataHub... looking to ingest metada using hive plugin; we currently have presto + hive metastore, what is the best way to ingest the data on this case? appreciate any help !
    l
    • 2
    • 7
  • a

    aloof-gigabyte-305

    09/09/2021, 5:21 PM
    Hi all, I’m new using Datahub trying to ingest some files as part of a POC. I’m trying to modify the example/demo_data using my own files but I’m having troubles with it: "generate_demo_data.sh" depends on the file “all_covid19_datasets.json” that is not available. If I try to run "bigquery_covid19_to_file.yml", I don't have credential for BigQuery. I assume if I want to prepare my files using “enrich.py” I should follow the same layout (json) as the “all_covid19_datasets.json” file but because the file is not available I can’t figure out the layout. To recap, I’m trying to run “enrich.py” using my files as the input file but can’t figure out what the layout of the input file should be. I understand it has to be json, but I also understand there are different json layouts like “split”, “table”, etc. Any help or direction is much appreciated. Thanks!
    m
    • 2
    • 10
  • c

    curved-daybreak-29035

    09/10/2021, 10:35 AM
    Hi! I’m new to datahub, trying some ingestion before implementing it for our company. How to manage schemas and tables that are deprecated or changed in names? I tried to follow some suggestion here but seems like links are gone and not exactly sure how to delete or soft delete tables that are deprecated. thanks in advance!
    w
    m
    • 3
    • 7
  • b

    brave-market-65632

    09/13/2021, 6:43 AM
    On the topic of lineage, our workloads are mostly Snowflake to Snowflake. Use cases being CTAS and merges. These are orchestrated by non-airflow tools including a custom designed one. For such cases, I think the preferred way to capture lineage is via the REST API (https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py)? Is there a way for metadata-ingestion framework to parse SQL queries and pick table - table lineage? Thanks!
    l
    • 2
    • 1
  • c

    calm-river-44367

    09/13/2021, 8:49 AM
    Hey everyone, I noticed a hdfs dataset in the demo, even though hadoop is not supported by sources of datahub. Does anyone have any idea how it was ingested or can this be done for other unsupported sources like minio? I would be grateful for your help
    w
    f
    • 3
    • 7
  • f

    faint-hair-91313

    09/13/2021, 9:36 AM
    Guys, I've ingested a business glossary recipe and it does not show up in the
    datahub ingest list-runs
    , so cannot roll back if needed. Does it happen on your side, too? The other recipes show up.
    g
    • 2
    • 6
  • r

    red-smartphone-15526

    09/13/2021, 11:32 AM
    Hey! In Bigquery we have a couple of sharded tables, where one new table is created each day, ie project.dataset.table_20210901, project.dataset.table_20210902 etc. Does anyone know how to avoid ingesting metadata for all of these tables? Would like to only ingest metadata for the lastest shard (with todays date).
    b
    g
    w
    • 4
    • 7
  • f

    faint-hair-91313

    09/13/2021, 1:47 PM
    Hi again, I am testing the Business Glossary ingestion and lineage towards a dataset. I am able to see the dataset pointing to the business glossary, but not the other way around. Related Entities looks empty. I am running 0.8.12. Also restored indices.
    p
    w
    m
    • 4
    • 14
  • c

    curved-jordan-15657

    09/13/2021, 2:08 PM
    Hello guys! I’m trying to use airflow with datahub kafka based connection. I’ve configured connection from Airflow UI, i wrote the kafkabroker with <my-broker>:9092. Since i’m using schema registry, i wrote that in “extra” field in json format. It’s something like:
    Copy code
    {
      "sink": {
        "config": {
          "connection": {
            "schema_registry_url": "<my-schema-registry>:8081"
          }
        }
      }
    }
    But it gives me an error like:
    Copy code
    sink
      extra fields not permitted (type=value_error.extra)
    Also above lines in log, i see
    Copy code
    Host: <my-broker>:9092, Port: None, Schema: , Login: , Password: None, extra: {'sink': {'config': {'connection': {'schema_registry_url': '<my-schema-registry>:8081'}}}}
    So, it gets my host but can not get port? I didn’t get it. Also, how can i resolve the extra fields problem. Thanks!
    w
    • 2
    • 4
  • c

    calm-river-44367

    09/14/2021, 8:31 AM
    please somebody help me out.How can I create users in datahub in order to set policies
    b
    • 2
    • 2
  • b

    blue-megabyte-68048

    09/15/2021, 3:20 PM
    Is there an ingester for Collibra, by chance?
    m
    c
    • 3
    • 7
  • t

    thousands-tailor-5575

    09/15/2021, 3:31 PM
    Hi, we are testing out Redash metadata ingestion library. Is there a posibility to extract chart -> dataset lineage? It looks like there is only Chart -> Data Platform lineage being retrieved.
    m
    a
    s
    • 4
    • 15
  • l

    lively-laptop-26966

    09/15/2021, 11:06 PM
    news for Elasticsearch???
    m
    • 2
    • 3
  • m

    microscopic-musician-99632

    09/16/2021, 1:28 PM
    In the datasources page SAP HANA could not be found. Can you please let me know if it is supported or planned?
    w
    l
    • 3
    • 2
  • c

    curved-sandwich-81699

    09/16/2021, 2:45 PM
    Hi all! Is it possible to ingest Glossary Terms at the root level without using nodes/folders like it was done here for AccountBalance/CustomerAccount/SavingAccount?: https://demo.datahubproject.io/browse/glossary
    m
    • 2
    • 2
  • b

    best-toddler-40650

    09/16/2021, 3:37 PM
    Hi I elaborated a recipe to connect to our oracle server as below:
    Copy code
    source:
      type: oracle
      config:
        username: user
        password: password
        host_port: myserver:1522
        
        service_name: servicename
    which translate in the following DSN passed to sqlalchmy connect dialect
    Copy code
    'dsn': '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=myservicename)))'
    however, the oracle server requires further parameters in the DSN in order to authenticate, as below:
    Copy code
    'dsn': '(description= (retry_count=20)(retry_delay=3)(address=(protocol=tcps)(port=1522)(host=myhost))(connect_data=(service_name=myservicename))(security=(ssl_server_cert_dn="CN=certsever, OU=Oracle, O=Oracle Corporation, L=City, ST=State, C=Country")))
    Any idea on how to insert these extra parameters in the oracle recipe
    w
    a
    • 3
    • 2
  • a

    aloof-gigabyte-305

    09/17/2021, 1:09 AM
    Hi! I'm ingesting BusinessGlossary using yml files as it was shown on "DataHub Product Updates: Aug 27 2021 Community Town Hall" (

    https://www.youtube.com/watch?v=dVqSZdN64sQ&amp;t=310s▾

    ). I used samples: examples/bootstrap_data/business_glossary.yml and examples/recipes/business_glossary_to_datahub.yml. The ingestion seems to works fine, I get the green message "Pipeline finished successfully" with 0 failures, 0 warnings, and 68 records_written. But when I go to the UI, I cannot find any Glossary anywhere. Any idea?
    l
    m
    • 3
    • 7
  • e

    eager-answer-71364

    09/17/2021, 4:52 AM
    How to get mce entity by specific aspects using api? Currently I use format <host>:8080/entities/<urn> but it return all aspects. Are there any params in API to get specific aspect?
    b
    m
    • 3
    • 4
  • h

    happy-orange-34018

    09/20/2021, 9:19 AM
    Hi team, I am trying to catalog mysql in datahub and I want the datasets in datahub to have some properties such as created time of the table and a property telling if a dataset is a view or table. How can I enable it in datahub?
    l
    • 2
    • 3
  • s

    stocky-noon-61140

    09/20/2021, 9:47 AM
    Hi everyone - is it possible to restrict / manage the ingestion privileges? In other words: With my locally installed version of Datahub, I don't need to authenticate to ingest data into Datahub. Is there a possibility to require username/passwort authentication before metadata can be ingested?
    b
    b
    b
    • 4
    • 10
  • a

    average-bear-318

    09/20/2021, 10:33 AM
    Hi Team, I store some metadata info of our ETL jobs in Postgres. Is it possible to ingest the data and not the metadata for the postgres table via DataHub?
    👍 1
    s
    m
    • 3
    • 2
  • q

    quiet-kilobyte-82304

    09/20/2021, 8:02 PM
    anyone has an example mce for
    KeyValueSchema
    ? Looking into ingesting key/value pairs from couchbase into datahub. I’m more interested in seeing how fields that have
    isPartOfKey
    set to
    true
    are exposed on the UI https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/SchemaField.pdl#L84
    l
    g
    m
    • 4
    • 11
  • p

    polite-flower-25924

    09/20/2021, 8:56 PM
    I’m facing the following error when running Hive ingestion with the latest linkedin/datahub-ingestion
    (97bed71)
    It works with older versions like
    c9c1ba4
    . I guess some libraries are changed and this needs to be fixed.
    l
    m
    g
    • 4
    • 12
  • m

    microscopic-musician-99632

    09/21/2021, 5:45 AM
    In 2 threads here https://datahubspace.slack.com/archives/CUMUWQU66/p1629897623476100 and https://datahubspace.slack.com/archives/CUMUWQU66/p1620136722311400 I understand that datahub supports openAPI based metadata ingestion. Could you please guide as to how this is supported and how can the API metadata be ingested. Also is OData supported?
    s
    b
    • 3
    • 11
  • h

    high-hospital-85984

    09/21/2021, 10:00 AM
    Hi! I was playing around with
    datahub ingest list-runs
    and got presented with something unexpected. Most IDs are random GUIDs, and one suspicioulsy large run wrt row count is called
    no-run-id-provided
    . We primarily use the kafka sink, is there a way of providing some human-readable name to the runs for easier rollback?
    m
    l
    • 3
    • 9
  • c

    calm-river-44367

    09/22/2021, 7:01 AM
    I want to delete a specific dataset, I tried doing so by the command in https://datahubproject.io/docs/how/delete-metadata#delete-by-urn but it only deletes the rows that my dataset contains and not the dataset itself. Is there anyway to hard delete in datahub. please let me know if you got any ideas
    s
    g
    • 3
    • 4
  • c

    calm-sunset-28996

    09/22/2021, 3:20 PM
    Hey, I'm currently building an API gateway in front of the GMS endpoint. However the RestEmitter does not accept custom /additional headers. So currently to get it to work I've monkey patched it with my own addition. Would it be of interest to allow a param
    extra_headers
    or something which would be k,v to add to the emitter? Then for future usecases people could add whatever they want if they would need it.
    m
    b
    • 3
    • 16
  • l

    loud-vase-59377

    09/23/2021, 12:43 AM
    Has anyone tried ingesting from bigquery? I was testing it out and cant make the pattern to work. For starters, I am trying to ingest and profile one specific table. I tried this but it ingest all tables in db1.
    Copy code
    source:
      type: bigquery
      config:
        project_id: "su-project1"
        schema_pattern: 
            allow: 
               - "db1"
        table_pattern: 
            allow:
               - "db1.table1"   
        profiling: {
          enabled: true,
        }   
        profile_pattern:
            allow:
               - "db1.table1" 
          
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    I tried removing the schema pattern and retain the table pattern but it ingest all database in the project. Are there other options I need to set?
    w
    a
    • 3
    • 4
  • r

    ripe-furniture-93265

    09/23/2021, 8:33 AM
    Hi All! Is there a way to configure custom “cluster” name for Airflow metadata ingestion? I’ve successfully set up airflow to send metadata and lineage information following the docs here with default config https://datahubproject.io/docs/metadata-ingestion#setting-up-airflow-to-use-datahub-as-lineage-backend, but now it looks like “cluster” part of urn defaults to “prod” and we would like to configure this to be unique per Airflow environment
    h
    b
    l
    • 4
    • 7
  • e

    eager-answer-71364

    09/23/2021, 11:03 AM
    I got an error when accessing UI with address: <host>/browse/dataset. Looking at log on docker and see it (screenshot). Anyone know which error it is? How to resolve it? Thanks.
    b
    • 2
    • 2
1...121314...144Latest