https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • b

    brief-ability-41819

    10/07/2022, 1:00 PM
    Hello, is it possible to ingest multiple S3 buckets under one DBT source? I’m trying to achieve something like:
    Copy code
    source:
        type: dbt
        config:
            aws_connection:
                aws_region: us-east-1
                aws_role: 'arn:aws:iam::************:role/DataHub-role'
            target_platform: s3
            manifest_path: 
    		- '<s3://bucket1/manifest.json>'
    		- '<s3://bucket2/manifest.json>'
            test_results_path: 
    		- '<s3://bucket1/run_results.json>'
    		- '<s3://bucket2/run_results.json>'
            sources_path: 
    		- '<s3://bucket1/sources.json>'
    		- '<s3://bucket2/sources.json>'
            catalog_path: 
    		- '<s3://bucket1/catalog.json>'
    		- '<s3://bucket2/catalog.json>'
    Whenever I try to create a list of them (in
    []
    brackets) ingestion stops working and throws:
    Copy code
    Failed to configure source (dbt) due to \n'
               "\t\t'4 validation errors for DBTConfig\n"
               'manifest_path\n'
               '  str type expected (type=type_error.str)
    but it’s perfectly fine with only one bucket configured. Am I missing something in my recipe?
    m
    • 2
    • 2
  • a

    alert-fall-82501

    10/07/2022, 1:51 PM
    Copy code
    owners: List[OwnerClass] = self.maybe_extract_owners(
      File "/home/kiranto@cybage.com/.local/lib/python3.8/site-packages/datahub/ingestion/source/csv_enricher.py", line 594, in maybe_extract_owners
        row["ownership_type"] if row["ownership_type"] else OwnershipTypeClass.NONE
    KeyError: 'ownership_type'
    [2022-10-07 19:05:26,767] ERROR    {datahub.entrypoints:195} - Command failed: 
    	'ownership_type'.
    	Run with --debug to get full stacktrace.
    	e.g. 'datahub --debug ingest -c csv.yaml'
    b
    • 2
    • 1
  • a

    alert-fall-82501

    10/07/2022, 1:51 PM
    Can anybody suggest on this ?
  • a

    alert-fall-82501

    10/07/2022, 1:53 PM
    I am working on csv enricher source to add tag , glossary term ,ownership etc. metadata . I had created the CSV file and given required details .
  • a

    adamant-furniture-37835

    10/07/2022, 6:09 PM
    HI, We are trying newly added PATCH functionality for Owners and Tags but it seems to be not working for us or maybe we are missing something. We have added this transformers info in recipe :
    transformers:
    - type: "simple_add_dataset_tags"
    config:
    tag_urns:
    - "urn:li:tag:TestTagToBeAppliedAutomatically"
    replace_existing: false
    semantics: PATCH
    - type: "simple_add_dataset_ownership"
    config:
    semantics: PATCH
    owner_urns:
    - "urn:li:corpuser:USER_ID"
    ownership_type: "PRODUCER"
    Mentioned Tag and Owner (USER_ID) is already available in datahub-gms before we run the ingestion process (tag availability doesn't affect result ) Token used in recipe is the personal token created by a user who has admin access to Datahub. Datahub CLI as well as server have version v0.8.45 (elasticsearch-setup-job has version v0.8.44, version 45 seems to have bug ) Please guide if we are missing something here
    l
    g
    • 3
    • 2
  • m

    many-rainbow-50695

    10/09/2022, 6:24 AM
    Hi! A few questions about business glossary ingestion: 1. I see that it's possible to add links to glossary term documentation using UI but I can't find relevant field in Datahub documentation and business glossary ingestion code. 2. Are properties 'inherits' and 'contains' reversible? If one glossary term inherits another does it mean that that another term should contain this one? If yes, than I think that it should be done automatically during business glossary ingestion 3. Is it possible to filter 'working set' of business glossary items using UI or any plans about it? I have a registry of semantic data types with 304 data types (business glossary terms) and their profiles include language and country information. Sometimes I may want to limit certain database to not to use certain data categories, or semantic types linked to some countries. Is that possible?
    g
    • 2
    • 5
  • b

    breezy-camera-11182

    10/10/2022, 2:48 AM
    Hi Team, I have a question regarding Looker-Datahub Ingestion. Is it possible to ingest Looks that doesn’t attached/belong to a dashboard? from what i understand, the recipe only ingest the dashboard & it’s elements (eg. Looks attached to a dashboard)
  • l

    limited-cricket-18852

    10/10/2022, 6:02 PM
    Hi all! I have been implementing the Lineage with Spark Databricks, but even setting the
    appName
    on the SparkSession, I keep producing
    Databricks Shell
    . Has anyone got more success?
    a
    • 2
    • 1
  • l

    limited-forest-73733

    10/10/2022, 2:12 PM
    Hey team, unable to ingest dbt metadata via datahub-kafka.Can anyone please help me out.
    m
    g
    • 3
    • 4
  • f

    future-hair-23690

    10/11/2022, 5:17 AM
    Hi guys, am experiencing the issue where my profiling does not start. Does anybody have an idea what might be wrong? There is no error or debug msg, just nothing happens related to profiling. I am using MSSQL(pyodbc) on cli version 0.8.45.2 My config:
    Copy code
    source:
      type: mssql
      config:
        password: ---------
        database: sandbox_validation
        host_port: 'az-uk-mssql-accept-01.logex.cloud:1433'
        username: ------
        use_odbc: 'true'
        uri_args:
            driver: 'ODBC Driver 17 for SQL Server'
            Encrypt: 'Yes'
            TrustServerCertificate: 'Yes'
            ssl: 'True'
        env: STG
        profiling:
          enabled: true
          limit: 10000
          report_dropped_profiles: false
          profile_table_level_only: false
    
          include_field_null_count: true
          include_field_min_value: true
          include_field_max_value: true
          include_field_mean_value: true
          include_field_median_value: true
          include_field_stddev_value: true
          include_field_quantiles: true
          include_field_distinct_value_frequencies: true
          include_field_sample_values: true
          turn_off_expensive_profiling_metrics: false
          include_field_histogram: true
          catch_exceptions: false
          max_workers: 4
          query_combiner_enabled: true
          max_number_of_fields_to_profile: 100
          profile_if_updated_since_days: null
          partition_profiling_enabled: false
        schema_pattern:
          deny:
            - DS\\oleksii
            - ds*
            - Logex*
          allow:
            - dbo.*
            - dbo
    cheers!
    g
    • 2
    • 7
  • l

    little-spring-72943

    10/11/2022, 8:30 AM
    We are tying to ingest Azure SQL server databases using managed ingestion and getting errors "'Error: Client does not have encryption enabled but it is required by server, enable encryption and try connecting again". Azure SQL there is no way to disable the SSL/TLS. This is designed keeping the security and vulnerability prospects of the database. Do we know how can overcome this with Datahub? When we try ODBC option we get "ODBC SQL type -150 is not yet supported. column-index=3 type=-150', 'HY106'"
    g
    m
    b
    • 4
    • 12
  • d

    damp-ambulance-34232

    10/11/2022, 9:59 AM
    What version of datahub support spark urnlidataPlatform:spark
    f
    • 2
    • 8
  • f

    famous-florist-7218

    10/11/2022, 10:02 AM
    Hi folks, It seems like bigquery-beta connector doesn’t support nested array in BigQuery. Any workaround?
    d
    b
    • 3
    • 18
  • a

    alert-fall-82501

    10/11/2022, 10:18 AM
    Hi Team - I have ingested metadata from hive and I tried to add the glossary term, tag and ownership to it . But it showing the error like ."Failed to create: Unauthorized to perform this action. Please contact your DataHub administrator." ..can anybody suggest on this ?
    d
    • 2
    • 2
  • m

    mammoth-apple-56011

    10/11/2022, 10:56 AM
    Greetings to you all. I have some question about ingestion the data from Tableau into Datahub. When there is some workbooks in Tableau with the slash symbol ("*/*") in it's names, the ingestion behaviour are strange. For ex.: In Tableau I have a workbook named "*ADM/ACM*" (it is the whole name of one workbook - the slash is a part of name). When this workbook are ingested by Datahub it transforms into separate ADM and ACM folders. At the same time the ACM folder located inside the ADM folder. So, the "*/*" symbol are interpreted like the folder separator. Is there some kind of escape symbols which I may use in ingestion code to tell the Datahub to not interpret the slash "*/*" symbol as some special, but just as an ordinary part of workbook name?
    plus1 2
    g
    • 2
    • 10
  • r

    ripe-tailor-61058

    10/11/2022, 4:56 PM
    Hello, Is there a way via recipe to automatically include the s3 url to the file as a field?
  • r

    ripe-tailor-61058

    10/11/2022, 4:58 PM
    here is my recipe file thus far: source: type: "s3" config: platform: s3 env: prod path_spec: include: "s3://loom/workbench/data/drone/Images/02cb04112b264cfaab32b7eea3c65f2c/*.*" aws_config: aws_access_key_id: <redacted> aws_secret_access_key: <redacted> aws_region: us-gov-west-1 # see https://datahubproject.io/docs/metadata-ingestion/sink_docs/file for complete documentation # authentication token is enabled so include in this config. See Access Token in docs. sink: type: datahub-rest config: server: http://localhost:8080 token: <redacted>
  • r

    ripe-tailor-61058

    10/11/2022, 5:00 PM
    anywhere I could find the link to the file after ingestion for further analysis would be great
    g
    • 2
    • 1
  • s

    salmon-jackal-36326

    10/11/2022, 6:51 PM
    Hello guys @witty-plumber-82249! I'm getting this message with the snowflake connector while ingesting but only for some tables:
    Copy code
    '[2022-10-11 18:20:04,294] ERROR    {datahub.ingestion.source.ge_data_profiler:934} - Encountered exception while profiling ', ["Profiling exception \'partial_unexpected_list\'"]},\n'  "KeyError: 'partial_unexpected_list'\n"
    
    '[2022-10-11 18:20:26,841] ERROR    {datahub.ingestion.source.ge_data_profiler:315} - Failed to get unique count for column '
    Some of my tables have columns with spaces in the name and don't have a primary key, I don't know if this is relevant. As it's my first time using it in docker on an ec2, I don't know the best practices properly. DATAHUB VERSION: v0.8.45
    Copy code
    source:
        type: snowflake
        config:
            include_table_lineage: true
            password: '${SNOWFLAKE_PASSWORD}'
            account_id: ACCOUNT
            role: accountadmin
            profiling:
                enabled: true
            include_view_lineage: true
            warehouse: DEV
            stateful_ingestion:
                enabled: true
            schema_pattern:
                deny:
                    - '.*DEV'
                    - '.*INFORMATION_SCHEMA'
                    - '.*PUBLIC'
            database_pattern:
                allow:
                    - ^DATABASE_A$
                    - ^DATABASE_B$
                    - ^DATABASE_C$
                    - ^DATABASE_D$
                    - ^DATABASE_E$
                    - ^DATABASE_F$
                    - ^DATABASE_G$
                    - ^DATABASE_H$
            username: '${SNOWFLAKE_USER}'
    pipeline_name: 'urn:li:dataHubIngestionSource:5a8d58a3-dc4e-43b4-a59e-05c6ef9e0bce'
    Same problem here as I can see and I tested using the params https://www.linen.dev/s/datahubspace/t/439789/hi-all-i-enable-profiling-but-got-an-error-called-partial-un
    g
    • 2
    • 7
  • c

    cool-translator-98249

    10/11/2022, 7:15 PM
    Hello, I have an ingestion source set up for Snowflake, and existing tables have worked great, but I just added a new table and view and ran the ingestion for that source, but the new objects aren't showing in Datahub. I verified the proper database and schema are included, and didn't see any errors in the log file. Any suggestions on how to debug from here?
    g
    • 2
    • 9
  • n

    narrow-toddler-80534

    10/12/2022, 7:35 AM
    Hello, I want to delete lineage between these datatask using acryl-datahub (via REST). Any suggestions for me ? Thanks !!!
    f
    • 2
    • 3
  • b

    billowy-pager-44683

    10/12/2022, 9:35 AM
    Hello Team, Due to the access right to database, I’m trying the metadata created as json file need to be ingested into the datahub. The source is currently set to mssql and I am looking at the json that is sinked as a file, and I know how to write the code to inject it into the datahub. - Source is mssql. - Create a metadata file imported from mssql using recipe yaml. (ref. ) - I am trying to ingest the metadata file after parsing with Python, is there any documentation I can refer to? - ex. File format when source is a file and ingesting it as mssql metadata
    g
    • 2
    • 2
  • c

    careful-action-61962

    10/12/2022, 10:14 AM
    hey folks, i want to create lineage between tableau reports and databricks table that it is using under the hood. Has anyone worked on this
    g
    • 2
    • 1
  • c

    colossal-hairdresser-6799

    10/12/2022, 12:20 PM
    Hello, I’m trying to figure out how to install the DatahubGraph client without but can’t find any plugin for it. Would have thought that it would be an acryl-datahub[datahub-graph] or it would be contained in the rest plugin.
    b
    • 2
    • 2
  • d

    delightful-barista-90363

    10/12/2022, 3:28 PM
    @mammoth-bear-12532 @careful-pilot-86309 was wondering what the status of this PR is https://github.com/datahub-project/datahub/pull/5687. would be a big add for my team. Right now we dont have the full lineage link between s3, athena, spark, etc
    m
    • 2
    • 3
  • b

    brainy-crayon-53549

    10/12/2022, 4:19 PM
    can someone help me in creating lineage in postgres and pull those lineages to datahub
    g
    • 2
    • 1
  • r

    rich-state-73859

    10/12/2022, 6:12 PM
    Got error when setting owner information in protobuf. I followed the documentation, and created the group
    test-group
    . Does anyone have a suggestion on how I can resolve this?
    Copy code
    Invalid urn format for aspect: {owners=[{owner=urn:li:corpgroup:test-group, type=PRODUCER, source={type=MANUAL}}], lastModified={actor=urn:li:corpuser:datahub, time=1665598238050}} for entity: urn:li:dataset:(urn:li:dataPlatform:athena,<table name here>,DEV)
    Cause: ERROR :: /owners/0/owner :: \"Provided urn urn:li:corpgroup:test-group\" is invalid: Entity type for urn: urn:li:corpgroup:test-group is not a valid destination for field path: /owners/*/owner
    g
    b
    • 3
    • 7
  • b

    brainy-table-99728

    10/12/2022, 6:27 PM
    Hey there, quick question regarding tags. We have tags applied in Snowflake, do these get imported into Datahub?
    g
    h
    • 3
    • 2
  • q

    quiet-wolf-56299

    10/12/2022, 8:23 PM
    Doing a bit of testing with a local install. I have a separate MySQL database up and running. I did a quickstart nuke and then restarted datahub, using the same compose file so I am definitely connecting to the static DB instance that was not cleared I no longer have metadata in datahub. I verified the metadata still exists in metadata_aspect_v2 in MySQL and ran the rebuild index command but It still shows me no metadata. Advice?
  • q

    quiet-wolf-56299

    10/12/2022, 9:03 PM
    was able to drop the datahub table and start over since its just a test its not a huge deal but curious if this is something i’d run into if I had to reset one of the containers in production
    g
    • 2
    • 7
1...767778...144Latest