DataHub #ingestion

brief-ability-41819

10/07/2022, 1:00 PM

Hello, is it possible to ingest multiple S3 buckets under one DBT source? I’m trying to achieve something like:

Copy code

source:
    type: dbt
    config:
        aws_connection:
            aws_region: us-east-1
            aws_role: 'arn:aws:iam::************:role/DataHub-role'
        target_platform: s3
        manifest_path: 
		- '<s3://bucket1/manifest.json>'
		- '<s3://bucket2/manifest.json>'
        test_results_path: 
		- '<s3://bucket1/run_results.json>'
		- '<s3://bucket2/run_results.json>'
        sources_path: 
		- '<s3://bucket1/sources.json>'
		- '<s3://bucket2/sources.json>'
        catalog_path: 
		- '<s3://bucket1/catalog.json>'
		- '<s3://bucket2/catalog.json>'

Whenever I try to create a list of them (in

[]

brackets) ingestion stops working and throws:

Copy code

Failed to configure source (dbt) due to \n'
           "\t\t'4 validation errors for DBTConfig\n"
           'manifest_path\n'
           '  str type expected (type=type_error.str)

but it’s perfectly fine with only one bucket configured. Am I missing something in my recipe?

alert-fall-82501

10/07/2022, 1:51 PM

Copy code

owners: List[OwnerClass] = self.maybe_extract_owners(
  File "/home/kiranto@cybage.com/.local/lib/python3.8/site-packages/datahub/ingestion/source/csv_enricher.py", line 594, in maybe_extract_owners
    row["ownership_type"] if row["ownership_type"] else OwnershipTypeClass.NONE
KeyError: 'ownership_type'
[2022-10-07 19:05:26,767] ERROR    {datahub.entrypoints:195} - Command failed: 
	'ownership_type'.
	Run with --debug to get full stacktrace.
	e.g. 'datahub --debug ingest -c csv.yaml'

alert-fall-82501

10/07/2022, 1:51 PM

Can anybody suggest on this ?

alert-fall-82501

10/07/2022, 1:53 PM

I am working on csv enricher source to add tag , glossary term ,ownership etc. metadata . I had created the CSV file and given required details .

adamant-furniture-37835

10/07/2022, 6:09 PM

HI, We are trying newly added PATCH functionality for Owners and Tags but it seems to be not working for us or maybe we are missing something. We have added this transformers info in recipe :

transformers:

- type: "simple_add_dataset_tags"

config:

tag_urns:

- "urn:li:tag:TestTagToBeAppliedAutomatically"

replace_existing: false

semantics: PATCH

- type: "simple_add_dataset_ownership"

config:

semantics: PATCH

owner_urns:

- "urn:li:corpuser:USER_ID"

ownership_type: "PRODUCER"

Mentioned Tag and Owner (USER_ID) is already available in datahub-gms before we run the ingestion process (tag availability doesn't affect result ) Token used in recipe is the personal token created by a user who has admin access to Datahub. Datahub CLI as well as server have version v0.8.45 (elasticsearch-setup-job has version v0.8.44, version 45 seems to have bug ) Please guide if we are missing something here

many-rainbow-50695

10/09/2022, 6:24 AM

Hi! A few questions about business glossary ingestion: 1. I see that it's possible to add links to glossary term documentation using UI but I can't find relevant field in Datahub documentation and business glossary ingestion code. 2. Are properties 'inherits' and 'contains' reversible? If one glossary term inherits another does it mean that that another term should contain this one? If yes, than I think that it should be done automatically during business glossary ingestion 3. Is it possible to filter 'working set' of business glossary items using UI or any plans about it? I have a registry of semantic data types with 304 data types (business glossary terms) and their profiles include language and country information. Sometimes I may want to limit certain database to not to use certain data categories, or semantic types linked to some countries. Is that possible?

breezy-camera-11182

10/10/2022, 2:48 AM

Hi Team, I have a question regarding Looker-Datahub Ingestion. Is it possible to ingest Looks that doesn’t attached/belong to a dashboard? from what i understand, the recipe only ingest the dashboard & it’s elements (eg. Looks attached to a dashboard)

limited-cricket-18852

10/10/2022, 6:02 PM

Hi all! I have been implementing the Lineage with Spark Databricks, but even setting the

appName

on the SparkSession, I keep producing

Databricks Shell

. Has anyone got more success?

limited-forest-73733

10/10/2022, 2:12 PM

Hey team, unable to ingest dbt metadata via datahub-kafka.Can anyone please help me out.

future-hair-23690

10/11/2022, 5:17 AM

Hi guys, am experiencing the issue where my profiling does not start. Does anybody have an idea what might be wrong? There is no error or debug msg, just nothing happens related to profiling. I am using MSSQL(pyodbc) on cli version 0.8.45.2 My config:

Copy code

source:
  type: mssql
  config:
    password: ---------
    database: sandbox_validation
    host_port: 'az-uk-mssql-accept-01.logex.cloud:1433'
    username: ------
    use_odbc: 'true'
    uri_args:
        driver: 'ODBC Driver 17 for SQL Server'
        Encrypt: 'Yes'
        TrustServerCertificate: 'Yes'
        ssl: 'True'
    env: STG
    profiling:
      enabled: true
      limit: 10000
      report_dropped_profiles: false
      profile_table_level_only: false

      include_field_null_count: true
      include_field_min_value: true
      include_field_max_value: true
      include_field_mean_value: true
      include_field_median_value: true
      include_field_stddev_value: true
      include_field_quantiles: true
      include_field_distinct_value_frequencies: true
      include_field_sample_values: true
      turn_off_expensive_profiling_metrics: false
      include_field_histogram: true
      catch_exceptions: false
      max_workers: 4
      query_combiner_enabled: true
      max_number_of_fields_to_profile: 100
      profile_if_updated_since_days: null
      partition_profiling_enabled: false
    schema_pattern:
      deny:
        - DS\\oleksii
        - ds*
        - Logex*
      allow:
        - dbo.*
        - dbo

cheers!

little-spring-72943

10/11/2022, 8:30 AM

We are tying to ingest Azure SQL server databases using managed ingestion and getting errors "'Error: Client does not have encryption enabled but it is required by server, enable encryption and try connecting again". Azure SQL there is no way to disable the SSL/TLS. This is designed keeping the security and vulnerability prospects of the database. Do we know how can overcome this with Datahub? When we try ODBC option we get "ODBC SQL type -150 is not yet supported. column-index=3 type=-150', 'HY106'"

damp-ambulance-34232

10/11/2022, 9:59 AM

What version of datahub support spark urnlidataPlatform:spark

famous-florist-7218

10/11/2022, 10:02 AM

Hi folks, It seems like bigquery-beta connector doesn’t support nested array in BigQuery. Any workaround?

alert-fall-82501

10/11/2022, 10:18 AM

Hi Team - I have ingested metadata from hive and I tried to add the glossary term, tag and ownership to it . But it showing the error like ."Failed to create: Unauthorized to perform this action. Please contact your DataHub administrator." ..can anybody suggest on this ?

mammoth-apple-56011

10/11/2022, 10:56 AM

Greetings to you all. I have some question about ingestion the data from Tableau into Datahub. When there is some workbooks in Tableau with the slash symbol ("*/*") in it's names, the ingestion behaviour are strange. For ex.: In Tableau I have a workbook named "*ADM/ACM*" (it is the whole name of one workbook - the slash is a part of name). When this workbook are ingested by Datahub it transforms into separate ADM and ACM folders. At the same time the ACM folder located inside the ADM folder. So, the "*/*" symbol are interpreted like the folder separator. Is there some kind of escape symbols which I may use in ingestion code to tell the Datahub to not interpret the slash "*/*" symbol as some special, but just as an ordinary part of workbook name?

plus1 2

ripe-tailor-61058

10/11/2022, 4:56 PM

Hello, Is there a way via recipe to automatically include the s3 url to the file as a field?

ripe-tailor-61058

10/11/2022, 4:58 PM

here is my recipe file thus far: source: type: "s3" config: platform: s3 env: prod path_spec: include: "s3://loom/workbench/data/drone/Images/02cb04112b264cfaab32b7eea3c65f2c/*.*" aws_config: aws_access_key_id: <redacted> aws_secret_access_key: <redacted> aws_region: us-gov-west-1 # see https://datahubproject.io/docs/metadata-ingestion/sink_docs/file for complete documentation # authentication token is enabled so include in this config. See Access Token in docs. sink: type: datahub-rest config: server: http://localhost:8080 token: <redacted>

ripe-tailor-61058

10/11/2022, 5:00 PM

anywhere I could find the link to the file after ingestion for further analysis would be great

salmon-jackal-36326

10/11/2022, 6:51 PM

Hello guys @witty-plumber-82249! I'm getting this message with the snowflake connector while ingesting but only for some tables:

Copy code

'[2022-10-11 18:20:04,294] ERROR    {datahub.ingestion.source.ge_data_profiler:934} - Encountered exception while profiling ', ["Profiling exception \'partial_unexpected_list\'"]},\n'  "KeyError: 'partial_unexpected_list'\n"

'[2022-10-11 18:20:26,841] ERROR    {datahub.ingestion.source.ge_data_profiler:315} - Failed to get unique count for column '

Some of my tables have columns with spaces in the name and don't have a primary key, I don't know if this is relevant. As it's my first time using it in docker on an ec2, I don't know the best practices properly. DATAHUB VERSION: v0.8.45

Copy code

source:
    type: snowflake
    config:
        include_table_lineage: true
        password: '${SNOWFLAKE_PASSWORD}'
        account_id: ACCOUNT
        role: accountadmin
        profiling:
            enabled: true
        include_view_lineage: true
        warehouse: DEV
        stateful_ingestion:
            enabled: true
        schema_pattern:
            deny:
                - '.*DEV'
                - '.*INFORMATION_SCHEMA'
                - '.*PUBLIC'
        database_pattern:
            allow:
                - ^DATABASE_A$
                - ^DATABASE_B$
                - ^DATABASE_C$
                - ^DATABASE_D$
                - ^DATABASE_E$
                - ^DATABASE_F$
                - ^DATABASE_G$
                - ^DATABASE_H$
        username: '${SNOWFLAKE_USER}'
pipeline_name: 'urn:li:dataHubIngestionSource:5a8d58a3-dc4e-43b4-a59e-05c6ef9e0bce'

Same problem here as I can see and I tested using the params https://www.linen.dev/s/datahubspace/t/439789/hi-all-i-enable-profiling-but-got-an-error-called-partial-un

cool-translator-98249

10/11/2022, 7:15 PM

Hello, I have an ingestion source set up for Snowflake, and existing tables have worked great, but I just added a new table and view and ran the ingestion for that source, but the new objects aren't showing in Datahub. I verified the proper database and schema are included, and didn't see any errors in the log file. Any suggestions on how to debug from here?

narrow-toddler-80534

10/12/2022, 7:35 AM

Hello, I want to delete lineage between these datatask using acryl-datahub (via REST). Any suggestions for me ? Thanks !!!

billowy-pager-44683

10/12/2022, 9:35 AM

Hello Team, Due to the access right to database, I’m trying the metadata created as json file need to be ingested into the datahub. The source is currently set to mssql and I am looking at the json that is sinked as a file, and I know how to write the code to inject it into the datahub. - Source is mssql. - Create a metadata file imported from mssql using recipe yaml. (ref. ) - I am trying to ingest the metadata file after parsing with Python, is there any documentation I can refer to? - ex. File format when source is a file and ingesting it as mssql metadata

careful-action-61962

10/12/2022, 10:14 AM

hey folks, i want to create lineage between tableau reports and databricks table that it is using under the hood. Has anyone worked on this

colossal-hairdresser-6799

10/12/2022, 12:20 PM

Hello, I’m trying to figure out how to install the DatahubGraph client without but can’t find any plugin for it. Would have thought that it would be an acryl-datahub[datahub-graph] or it would be contained in the rest plugin.

delightful-barista-90363

10/12/2022, 3:28 PM

@mammoth-bear-12532 @careful-pilot-86309 was wondering what the status of this PR is https://github.com/datahub-project/datahub/pull/5687. would be a big add for my team. Right now we dont have the full lineage link between s3, athena, spark, etc

brainy-crayon-53549

10/12/2022, 4:19 PM

can someone help me in creating lineage in postgres and pull those lineages to datahub

rich-state-73859

10/12/2022, 6:12 PM

Got error when setting owner information in protobuf. I followed the documentation, and created the group

test-group

. Does anyone have a suggestion on how I can resolve this?

Copy code

Invalid urn format for aspect: {owners=[{owner=urn:li:corpgroup:test-group, type=PRODUCER, source={type=MANUAL}}], lastModified={actor=urn:li:corpuser:datahub, time=1665598238050}} for entity: urn:li:dataset:(urn:li:dataPlatform:athena,<table name here>,DEV)
Cause: ERROR :: /owners/0/owner :: \"Provided urn urn:li:corpgroup:test-group\" is invalid: Entity type for urn: urn:li:corpgroup:test-group is not a valid destination for field path: /owners/*/owner

brainy-table-99728

10/12/2022, 6:27 PM

Hey there, quick question regarding tags. We have tags applied in Snowflake, do these get imported into Datahub?

quiet-wolf-56299

10/12/2022, 8:23 PM

Doing a bit of testing with a local install. I have a separate MySQL database up and running. I did a quickstart nuke and then restarted datahub, using the same compose file so I am definitely connecting to the static DB instance that was not cleared I no longer have metadata in datahub. I verified the metadata still exists in metadata_aspect_v2 in MySQL and ran the rebuild index command but It still shows me no metadata. Advice?

quiet-wolf-56299

10/12/2022, 9:03 PM

was able to drop the datahub table and start over since its just a test its not a huge deal but curious if this is something i’d run into if I had to reset one of the containers in production