DataHub #ingestion

polite-actor-701

01/03/2023, 12:27 AM

hello! I'm doing an ingestion test right now. However, since a large amount of data is ingested, there is a problem that indexing in Elasticsearch takes too long. I need to ingest more than 80,000 table data, but only about 2,000 data were indexed in about two hours. Is there any way to reduce the time it takes to index? Please help me. Thank you.

ancient-petabyte-52399

01/03/2023, 10:41 AM

Hi everyone, I do an ingestion with delta lake as this sample on local Mac M1 https://datahubproject.io/docs/generated/ingestion/sources/delta-lake But when I ran

datahub ingest -c delta.dhub.yaml

, happened this problem (in image). I don’t what root cause is

✅ 1

bland-orange-13353

01/03/2023, 6:51 PM

This message was deleted.

✅ 1

hallowed-shampoo-52722

01/03/2023, 6:51 PM

Aks is giving me just IP

green-lion-58215

01/03/2023, 6:59 PM

Hi Team, while running the data pipeline ingestion for redshift I have started receiving this error message. This process was running successfully till Yesterday. It seems like it could be stemming from this dependancy issue reported for open SSL. has anyone else faced this issue? Any solution for this? for context I am using the following packages requirements=[ “apache-airflow==1.10.15”, “apache-airflow-backport-providers-amazon==2021.3.3", “acryl-datahub==0.9.0”, “acryl-datahub[redshift]==0.9.0"]

Copy code

[2023-01-03 18:03:43,123] {{python_operator.py:323}} INFO - Got error output
b'Traceback (most recent call last):\n  File "/tmp/venveo6od6mv/script.py", line 113, in <module>\n    res = generate_redshift_metadata(*args, **kwargs)\n  File "/tmp/venveo6od6mv/script.py", line 9, in generate_redshift_metadata\n    from datahub.ingestion.run.pipeline import Pipeline\n  File "/tmp/venveo6od6mv/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 16, in <module>\n    from datahub.ingestion.api.committable import CommitPolicy\n  File "/tmp/venveo6od6mv/lib/python3.7/site-packages/datahub/ingestion/api/__init__.py", line 1, in <module>\n    from datahub.ingestion.api.common import RecordEnvelope\n  File "/tmp/venveo6od6mv/lib/python3.7/site-packages/datahub/ingestion/api/common.py", line 5, in <module>\n    import requests\n  File "/usr/local/lib/python3.7/site-packages/requests/__init__.py", line 95, in <module>\n    from urllib3.contrib import pyopenssl\n  File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 46, in <module>\n    import OpenSSL.SSL\n  File "/usr/local/lib/python3.7/site-packages/OpenSSL/__init__.py", line 8, in <module>\n    from OpenSSL import crypto, SSL\n  File "/usr/local/lib/python3.7/site-packages/OpenSSL/crypto.py", line 3268, in <module>\n    _lib.OpenSSL_add_all_algorithms()\nAttributeError: module \'lib\' has no attribute \'OpenSSL_add_all_algorithms\'\n'

✅ 1

hallowed-shampoo-52722

01/03/2023, 8:30 PM

Hi team, whats the best practice to post yaml recipe to the datahub from terraform?

bland-lighter-26751

01/03/2023, 9:23 PM

Hello, trying to get UI ingestions to work with metadata service authentication. I set the

METADATA_SERVICE_AUTH_ENABLED

environment variable to "true" for the

datahub-gms

AND

datahub-frontend

I then generated a token in the UI. After saving the token, I changed my recipe through the UI to look like

Copy code

source:
    type: metabase
    config:
        connect_uri: '<https://metabase.domain.com>'
        username: <mailto:data@domain.com|data@domain.com>
        password: password
sink:
    type: datahub-rest
    config:
        token: eyJhbGciOiJIUzxxxxxxxxxxxxxxxxxxxxxx

When I try to run an ingestion, I get

Copy code

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: <http://datahub-gms:8080/aspects?action=ingestProposal>

Am I missing something?

brave-waitress-14748

01/04/2023, 3:15 AM

Hi all, I'm evaluating DataHub for use in a GCP (BQ) based data mesh implementation. I'm currently looking for advice regarding the best way to ingest additional metadata from Google Data Catalog (GDC), to populate a range of aspects (tags, domain, owner, documentation etc) which aren't natively populated by the BQ ingestion process. I'm thinking my current options are • Add a transformer for each aspect that I would like to populate, and have each transformer make redundant API calls out to GDC... this gets noisy quickly! • Add a datahub action to respond to metadata change request events, which makes a single call to GDC and updates the desired aspects for the container / dataset • Add a custom ingestion source, maybe extending the existing BQ ingestion source • Use the REST emitter directly, push data in from an external process I'm leaning towards adding an action, but this feels like a fairly complex solution to implement what should be a pretty straightforward piece of code. Am I on the right track? Any other ideas? Cheers!

acoustic-winter-85299

01/04/2023, 9:33 AM

Hi everyone. We’ve started our journey with Datahub for a while now, an obviously we want to add more and more stuff in it, as both our team and the product grow more mature. We’re currently using Google DataStudio (newly rebranded to Looker Studio) as our main visualisation solution, but we cannot ingest info about the dashboards automatically or create any linage between datasets and dashboards. I found a feature request that was opened about 1 year ago(https://feature-requests.datahubproject.io/p/data-studio-support) and it is still “In Review”. Are there any updates on this topic by any change? Also, for us until now it wasn’t critical to have this integration, as DataStudio was working in a slightly decentralised manner (each user could create their own reports/dashboards/datasets) and we wouldn’t have full visibility over it. With the rebranding to Looker Studio, Google also introduced a new version, Looker Studio Pro, which provides the integration with GCP projects, and you can associate the users to the Looker Studio Pro “instance”. This enables us to have much better governance over all the assets produced in DataStudio/Looker Studio and would also pave the road for a much leaner integration to DataHub since we could potentially have one single point of entry to ingest all the data available. Considering there’s no official DataHub connector and it doesn’t seem to be on the roadmap yet, is there anyone in the community who developed anything custom for DataStudio? Anyone who is thinking of a similar setup for the future? Any other ideas worth considering? (We’re probably going to start developing a connector ourselves in the next couple of months if this doesn’t make it to the official roadmap soon) Cheers!

✅ 1

faint-actor-78390

01/04/2023, 10:41 AM

Hi all , How to ingest data in another folder than "Prod" Folder. How to manage different environment dev / test / prod ? Thanks and happy new year to all !

✅ 1

👀 1

melodic-dress-7431

01/04/2023, 12:48 PM

Hi All...not sure if this is already addressed ...I have few datasources with API endpoint ...can the output of endpoint be ingested in Datahub (have custom tool for pipelines which has API interface and need to display the details and status which is returned as json output)

✅ 1

👀 1

cuddly-butcher-39945

01/04/2023, 3:41 PM

Hey Team! I have deployed our datahub POC environment onto AWS using the standard charts (mysql, neo4j,etc...) . Version = 0.9.5. I am now trying to ingest LookML and getting the following Git error...

"[2023-01-04 15:35:22,763] ERROR    {datahub.entrypoints:213} - Command failed: Cmd('git') failed due to: exit code(128)\n"

' cmdline: git clone -v -- git@github.com:git@github.com:blahblahblah/blahblahblah.git ' '/tmp/tmpohcqgm1elookml_tmp/e8911598-aea0-4cf1-a5bd-beafafc35814/checkout\n' " stderr: 'Cloning into '/tmp/tmpohcqgm1elookml_tmp/e8911598-aea0-4cf1-a5bd-beafafc35814/checkout'...\n" "Warning: Permanently added 'github.com,444.44.444.444' (ECDSA) to the list of known hosts.\n" 'fatal: remote error: \n' ' is not a valid repository name\n' ' Visit https://support.github.com/ for help\n' "'\n" 'Traceback (most recent call last):\n' ' File "/tmp/datahub/ingest/venv-lookml-0.9.5/lib/python3.10/site-packages/datahub/entrypoints.py", line 171, in main\n' ' sys.exit(datahub(standalone_mode=False, **kwargs))\n' ' File "/tmp/datahub/ingest/venv-lookml-0.9.5/lib/python3.10/site-packages/click/core.py", line 1130, in __call__\n' I have verified the repo is correct, as well as my ingestion source yaml...

source:

type: lookml config: parse_table_names_from_sql: true stateful_ingestion: enabled: true github_info: repo: 'git@github.com:blahblahblah/blahblahblah.git' deploy_key: "-----BEGIN RSA PRIVATE KEY-----\blahblahblahblah=\n-----END RSA PRIVATE KEY-----\n" project_name: clinical api: base_url: 'https://sharphealblahblahblah.cloud.looker.com/' client_id: blahblahblah client_secret: blahblahblah connection_to_platform_map: snowflake_o: platform: snowflake default_db: dev_mart_db Any help on this would be greatly appreciated!

miniature-librarian-48611

01/04/2023, 7:22 PM

When data is ingested and successfully imported, assuming the data gets stored in metadata_aspect_v2 table for mysql, under what Aspect name would I find the relevant info for that specific metadata?

✅ 1

miniature-plastic-43224

01/05/2023, 12:35 AM

All, my company wants to extend the existing corpUser entity by adding 2 more fields into it: accountType and employeId. Also we are planning to extend ldap functionality to populate these 2 fields. I have already created a feature request for accountType, but I don't know if those are really observed on a regular base, wanted to discuss it here. Now, we can implement this all and an then contribute to the community

miniature-plastic-43224

01/05/2023, 12:47 AM

All, my company wants to extend the existing corpUser entity by adding 2 more fields into it: "employeId" and "accountType". Also we are planning to extend ldap functionality to populate these 2 fields during an ingestion. There will be no default ldap attributes mapping for these fields at the code (so they will have value as None), but you can always update configuration YAML file and provide these mapping based on your corporate ldap attributes, so they will be populated. I think accountType should be String type, but I am not quite sure about employeId. Long will work for my company, but String appears to be more flexible (you can find a discussion on git hub (#4682) about similar issue related to departmenNumber ). Thus, I have 2 questions: 1) what do you think about adding employeId and accountType into corpUser entity along with updates at ldap ingestion? 2) What type do you think would better suite an "employeId, String or Long? All updates will be made by us, so we can contribute to the community as well. Thank you.

plain-cricket-83456

01/05/2023, 1:38 AM

Good morning, everyone. I want to know the configuration list of the hive database login account and password. Can you give me an example?

lively-dusk-19162

01/05/2023, 3:00 AM

Hello all, Could any one please help me on how to delete airflow pipeline and tasks in datahub?

rapid-city-92351

01/05/2023, 11:08 AM

Good day everyone, i am ingesting snowflake meta data to datahub and i am asking myself if it´s possible to ingest the date and time when the table/view or even columns were added/updated to snowflake? Not when it was last synced or added into datahub but the information when the datat was added to snowflake. Is that possible? I didn´t find anything in the documentation or maybe i am searching for the wrong term. Thank you 🙂

alert-fall-82501

01/05/2023, 11:19 AM

Hi Team - Can you please help me with hive source error while ingestion ? ..Check the logs file in thread

proud-waitress-17589

01/05/2023, 5:29 PM

Hi Team, we have two datahub instances running, one for development and one for production. Given there are two separate environments and entities have different URNs between the two environments, I am attempting to parameterize recipe transformers to support passing in URNs as environment variables. It seems that the variable expansion works correctly for the base recipe (connection settings), but my transformers are not being applied. Is this expected, or do transformers also support variable parameterization?

future-iron-16086

01/05/2023, 4:04 PM

Hi. I'm trying to ingest Power BI metadata, but I'm getting an error. We followed this steps (https://learn.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal), we didn't get success. Any help?

Copy code

sink:
    type: datahub-rest
    config:
        server: '<http://datahub-datahub-gms:8080>'
        token: '${TOKEN}'
source:
    type: powerbi
    config:
        workspace_id_pattern:
            allow:
                - hidden
        tenant_id: hidden
        dataset_type_mapping:
            PostgreSql: postgres
            Oracle: oracle
            SqlServer: mssql
        client_secret: hidden
        extract_ownership: false
        env: qa
        client_id: hidden

hallowed-shampoo-52722

01/05/2023, 6:14 PM

Hi Team, I haven't created a dns yet to show our team the demo of datahub instance.. Is it possible for me to run datahub-frontend on port 443. My org is very restrictive and only few ports are enable!

rapid-fall-7147

01/06/2023, 6:05 AM

Hi Team , is there a yaml configuration to ingest mlmodel and feature table

✅ 1

magnificent-notebook-88304

01/06/2023, 7:49 AM

Hi All, we are facing several issues while we profile the data (stats) as part of ingestion. One of the issue is that the profiling is taking multiple days to complete - primarily because the tables that we are trying to profile are indeed very big - billions of rows and many columns. is there a config somewhere by which i can give more juice to my profiling job by allocating more cores and more memory?

👀 1

✅ 1

faint-actor-78390

01/06/2023, 2:25 PM

Hi all, Trying to set a business glossary import following the documentation and using the CLI , v0.9.1. got an entry in ingestion panel but no result. Config 2023-01-06 151435,802] DEBUG {datahub.entrypoints:204} - GMS config {'models': {}, 'patchCapable': True, 'versions': {'linkedin/datahub': {'version': 'v0.9.1', 'commit': 'c01aa53aa7151c089494c39ce34b0377ab8d0e1b'}}, 'managedIngestion': {'defaultCliVersion': '0.9.1', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'} CLI file : source: type: datahub-business-glossary config: # Coordinates file: ./business_glossary.yml ~~enable_auto_id: true~~ Suppress the key enable_auto_id , going further now it seems there is some validation error in the sample file in attach Finaly mangges this ingestion, but sample file seems to have a lot of errors On a classical postgresql import no issue.

✅ 1

👀 1

adorable-activity-7956

01/06/2023, 2:42 PM

If I want to make EditableMLFeatureProperties description searchable, do I need to fork and modify the EditableMLFeatureProperties.pdl? Or is there a way to override the existing definition at deploy/runtime?

✅ 1

thankful-fireman-70616

01/06/2023, 3:19 PM

column lineage is not supported for postgresql out of the box ?

✅ 1

polite-art-12182

01/08/2023, 7:31 PM

Hi all,

polite-art-12182

01/08/2023, 7:39 PM

Hi all, I'm looking for help/suggestions on best way to import event datasets that are defined in an xsd. I have and xsd that defines several complex dataTypes along with the elements that are the dataset events. Example complexType would be "commonMetadataType", "DoubleOrEmptyType", "DateTimeOrEmptyType" and example events/datasets would be "NewItemEvent", "ItemUpdateEvent", "MissedUpdateEvent". I'm trying to get the event definitions into DataHub so I can then show: • definition/schema/etc of the event (I know this could, and maybe should, be done with a schema registry but hoping to not have to introduce extra tools/technologies if not needed). • show lineage of data fields from a given event/dataset get tied in with other datasets. I thought I remembered seeing something in the documentation about using DataHub for datasets that are more event-based verse database based but I can't seem to find that anymore. Some of the questions I have are: • Are there any general guidelines, suggestions, best practices, etc for using DataHub in an Event Driven Architecture to capture the interrelationships between producing and consuming services? For example, showing that service A produces event FOO with data it consumed from event BAR and BAZ. • How should the complexTypes that are shared between the events be accounted for? could make them each their own dataset, that are upstream dependencies of the events? Or just break them down in the event specific dataset since the complexTypes aren't really datasets but just data structure definitions • Is there a way to automatically build a dataset from an XSD? Right now I'm planning a custom python script and emitting the dataset to DataHub. • Are there ways to injest datasets from a Schema Registry such as Apicurio? Again, I'm expecting to have to do a custom recipe for this. Any suggestions, thoughts or input would be greatly appreciated. Thanks.

alert-fall-82501

01/09/2023, 7:04 AM

Hi Team - I want exclude specific schema from ingestion of hive source , I tried to do from schema_pattern.deny but getting validation error . can you please help me with this ?