DataHub #ingestion

numerous-account-62719

02/16/2023, 4:35 AM

Hi Team, How to enable the validation feature in datahub?

important-afternoon-19755

02/16/2023, 5:54 AM

Hi. When I do ingestion with athena, only some of the tables are ingested. Does anyone know the cause? Only up to 1.1k tables are ingested.

best-napkin-60434

02/16/2023, 7:04 AM

Hi all, Does datahub-actions not support AWS MSK with IAM? I installed datahub (v0.9.2) on EKS and everything worked normally, but the following error occurred in datahub-actions.

%6|1676528381.357|FAIL|rdkafka#consumer-1| [thrd:{server}.]: {server}/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 0ms in state APIVERSION_QUERY, 3 identical error(s) suppressed)

plus1 1

late-bear-87552

02/16/2023, 7:09 AM

Hi all, can anyone help me understanding this error

Copy code

Caused by: 
org.elasticsearch.client.ResponseException: method [HEAD], host [http://******:9200], URI [/graph_service_v1?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open%2Cclosed&allow_no_indices=false], status line [HTTP/1.1 503 Service Unavailable]|Warnings: [Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html> to enable security., [ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.]
	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:302)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:272)

fierce-baker-1392

02/16/2023, 7:56 AM

Hi team, I want use ingestion-cron-job config to ingest datasource, I have changed the values.yaml in datahub-ingestion-cron, but it run failed, the error logs as follows, could you tell me how to modify it? thanks.

Copy code

Error: UPGRADE FAILED: YAML parse error on datahub/charts/datahub-ingestion-cron/templates/cron.yaml: error converting YAML to JSON: yaml: line 36: found unexpected end of stream
helm.go:84: [debug] error converting YAML to JSON: yaml: line 36: found unexpected end of stream
YAML parse error on datahub/charts/datahub-ingestion-cron/templates/cron.yaml

Copy code

image:
  repository: linkedin/datahub-ingestion
  tag:
  pullPolicy: IfNotPresent

imagePullSecrets: []

crons:
  glossary:
    schedule: "0 1 * * *"
    recipe:
      configmapName: recipe-conf
      fileName: business_glossary.recipe.yaml
    command: ["/bin/sh", "-c", "pip install 'acryl-datahub[datahub-business-glossary]'; datahub ingest -c business_glossary.recipe.yaml"]
    extraVolumes: 
      - name: recipe-conf-volume
        configMap: 
            name: recipe-conf
    extraVolumeMounts:
      - name: recipe-conf-volume
        mountPath: /etc/recipe/data/business_glossary_dimension.yaml
        subPath: business_glossary_dimension.yaml
        readOnly: true
      
global:
  datahub:
    version: head

✅ 2

👀 1

enough-bear-93481

02/16/2023, 11:29 AM

Hello. I'm not sure if this is the right channel - I'll try anyway. Is there a recommended way to model data producers and consumers? In a micro-service architecture, we have different services producing data, and different services consuming data. I would like to be able to model this as part of the (awesome!) lineage that we have in Datahub. Any suggestions ?

✅ 1

billowy-flag-4217

02/16/2023, 11:38 AM

Hello, does anyone know if the Looker ingestion library supports Looks ingestion? From what I can see it only seems to get Dashboards and their associated charts, is that correct?

many-solstice-66904

02/16/2023, 3:54 PM

Good afternoon, I am writing a custom Java emitter for our platform and I have to convert an Avro schema to Datahub Dataset Schema with nested types and I wondered if there exists a function in the emitter library which can do that for me? I see that

avro_schema_to_mce_fields

exists in python but I see no corresponding functionality in the java library. Thanks in advance!

✅ 1

purple-printer-15193

02/16/2023, 4:20 PM

Hello, is there any other way to lowercase Snowflake asset titles (not URNs) other than creating a Transformer?

cuddly-kite-88848

02/16/2023, 5:29 PM

Hi all, Is there a list of all accepted objects to use as inlets/outlets for Airflow operators? Thanks!

✅ 1

dazzling-microphone-98929

02/16/2023, 6:48 PM

Hello Team, I am trying to ingest metadata from Power BI " extra fields not permitted (type=value_error.extra) How can we solve this problem?

lively-dusk-19162

02/16/2023, 7:52 PM

Hello team, I am trying to create a new entity and I want to make those changes reflected in UI too. So ist it possible to write graphQL code in python? Because graphQL code in datahub’s github repository is already written in JAVA

✅ 1

lively-dusk-19162

02/16/2023, 9:08 PM

Hello all, Could anyone help me out on how to re run or re build datahub after making changes to create a new entity?

✅ 1

best-wire-59738

02/17/2023, 5:58 AM

Hello Team, Could you please help me which folder can I find all the ingestion logs in the action pod. we need to pull out those logs to a separate platform for easy debugging.

numerous-computer-7054

02/17/2023, 8:06 AM

Hello all, I'm having an issue ingesting from MSSQL 2019 using the UI, or any other database for that matter. The environment is windows 10, Datahub is running on Docker Desktop WSL 2, and my MSSQL instance is installed locally on my host machine (not dockerized) The YAML recipe from the UI:

source:

type: mssql

config:

database: AdventureWorksLT2019

username: datahub_test

password: '${mssql_password}'

host_port: '172.17.189.15:1433'

I've tried using the host's IP address, localhost, the SQL server name, and nothing is working. I keep getting this error when using an IP address:

Copy code

PipelineInitError: Failed to configure the source (mssql): (pytds.tds_base.LoginError) ("Cannot connect to server '172.17.189.15': timed out", TimeoutError('timed out'))
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)

or this if using the server name:

Copy code

PipelineInitError: Failed to configure the source (mssql): (pytds.tds_base.LoginError) ("Cannot connect to server 'DESKTOP-9LGDO7K': [Errno 22] Invalid argument", OSError(22, 'Invalid argument'))
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)

The SQL server is well configured, port 1433 is open, I can successfully connect to SQL server from other than Datahub, such as from Python. What am I missing/ doing wrong? Thanks!

✅ 1

creamy-van-28626

02/17/2023, 10:17 AM

Hi team, When we do any change event how does it flow or consumed in mae and mce and Kafka ?

✅ 1

red-waitress-53338

02/17/2023, 12:35 PM

Hi Team, We are trying to ingest metadata from BigQuery with profiling enabled. The ingestion job ran fine but the Stats tab is still disabled on the UI, why is that disabled although we enabled profiling via the BigQuery ingestion recipe. Any idea?

dazzling-microphone-98929

02/17/2023, 12:51 PM

Hi Team, I'm trying ingest Power BI data, but this is happening:

blue-crowd-84759

02/17/2023, 4:37 PM

Hey all, I’m trying to run a bigquery ingestion using v0.10.0, my recipe yaml is in thread, but I’m not seeing any usage stats and all of our queries tabs are greyed out. Specifically, I’m looking at one of our known, most-used tables and it looks like the attached screenshot. This was working fine on v0.9.6/v0.9.6.1

red-waitress-53338

02/17/2023, 5:13 PM

Hi, We are trying to run the ingestion jobs through the UI, but the status is showing as N/A. Although the ingestion is working fine using the CLI. The gms and frontend containers are both in healthy condition.

blue-agency-87812

02/17/2023, 6:38 PM

Hello I'm performing data ingestion using metabase as a source. Everything works perfectly. However, when a dashboard or other item is deleted in the source, it remains in the datahub. Is there any way to keep the data synchronized between the metabase and the datahub?

adorable-river-99503

02/17/2023, 7:19 PM

im having issues with ingestion. I have two docker set ups one for datahub and one for my DBT instance. Im wondering how to get the manifest and catalogue json from my dbt into the datahub docker quickstart?

✅ 1

gifted-bear-4760

02/19/2023, 10:43 AM

Hello everyone! I have just started off with DataHub and was curious to know that is it possible to change the Data Types of Columns here? For instance, the session_id column here is ingested as a string. What if we want to convert it, for example, to an Integer/Float/Number Data Type?

✅ 1

acceptable-rain-30599

02/20/2023, 1:08 AM

Hey guys! I'm having an issue right now. I'm using dbt core and trying to add the source to datahub, and I have installed datahub on docker and dbt core locally, both on the same virtual machine. Then the error came out. I'm very confused why the folder can't be found, cause they are literally on the same server...Is there anyone trying to ingest from dbt?

acceptable-rain-30599

02/20/2023, 1:10 AM

image.png

numerous-account-62719

02/20/2023, 5:27 AM

Getting the following error Team, please help me out I am trying to run the ingestion pipeline

``` 'OperationalError: (pyhive.exc.OperationalError) TOpenSessionResp(status=TStatus(statusCode=3, '

"infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Failed to open new session: org.apache.hadoop.fs.s3a.AWSS3IOException: "

'getFileStatus on s3a://hivemr3-6054f65b-976f-4d25-8a0e-ea0a33898569/workdirtest: com.amazonaws.services.s3.model.AmazonS3Exception: '

'Gateway Time-out (Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway Time-out; Request ID: null; S3 Extended Request ID: '

'null; Proxy: null), S3 Extended Request ID: null:504 Gateway Time-out: Gateway Time-out (Service: Amazon S3; Status Code: 504; Error '

"Code: 504 Gateway Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: null)1413', "

"'org.apache.hive.service.cli.session.SessionManagercreateSessionSessionManager.java:434', "

"'org.apache.hive.service.cli.session.SessionManageropenSessionSessionManager.java:373', "

"'org.apache.hive.service.cli.CLIServiceopenSessionCLIService.java:187', "

"'org.apache.hive.service.cli.thrift.ThriftCLIServicegetSessionHandleThriftCLIService.java:480', "

"'org.apache.hive.service.cli.thrift.ThriftCLIServiceOpenSessionThriftCLIService.java:322', "

"'org.apache.hive.service.rpc.thrift.TCLIService$Processor$OpenSessiongetResultTCLIService.java:1497', "

"'org.apache.hive.service.rpc.thrift.TCLIService$Processor$OpenSessiongetResultTCLIService.java:1482', "

"'org.apache.thrift.ProcessFunctionprocessProcessFunction.java:39', 'org.apache.thrift.TBaseProcessorprocessTBaseProcessor.java:39', "

"'org.apache.hive.service.auth.TSetIpAddressProcessorprocessTSetIpAddressProcessor.java:56', "

"'org.apache.thrift.server.TThreadPoolServer$WorkerProcessrunTThreadPoolServer.java:286', "

"'java.util.concurrent.ThreadPoolExecutorrunWorkerThreadPoolExecutor.java:1149', "

"'java.util.concurrent.ThreadPoolExecutor$WorkerrunThreadPoolExecutor.java:624', 'java.lang.ThreadrunThread.java:748', "

"'*java.lang.RuntimeExceptionorg.apache.hadoop.fs.s3a.AWSS3IOException getFileStatus on "

's3a://hivemr3-6054f65b-976f-4d25-8a0e-ea0a33898569/workdirtest: com.amazonaws.services.s3.model.AmazonS3Exception: Gateway Time-out '

'(Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: null), '

'S3 Extended Request ID: null:504 Gateway Time-out: Gateway Time-out (Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway '

"Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: null)173', "

"'org.apache.hadoop.hive.ql.session.SessionStatestartSessionState.java:652', "

"'org.apache.hadoop.hive.ql.session.SessionStatestartSessionState.java:593', "

"'org.apache.hive.service.cli.session.HiveSessionImplopenHiveSessionImpl.java:171', "

"'org.apache.hive.service.cli.session.SessionManagercreateSessionSessionManager.java:425', "

"'*org.apache.hadoop.fs.s3a.AWSS3IOException:getFileStatus on s3a://hivemr3-6054f65b-976f-4d25-8a0e-ea0a33898569/workdirtest: "

'com.amazonaws.services.s3.model.AmazonS3Exception: Gateway Time-out (Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway '

'Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:504 Gateway Time-out: Gateway '

'Time-out (Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: '

"null)2710', 'org.apache.hadoop.fs.s3a.S3AUtilstranslateExceptionS3AUtils.java:265', "

"'org.apache.hadoop.fs.s3a.S3AUtilstranslateExceptionS3AUtils.java:145', "

"'org.apache.hadoop.fs.s3a.S3AFileSystems3GetFileStatusS3AFileSystem.java:2248', "

"'org.apache.hadoop.fs.s3a.S3AFileSysteminnerGetFileStatusS3AFileSystem.java:2149', "

"'org.apache.hadoop.fs.s3a.S3AFileSystemgetFileStatusS3AFileSystem.java:2088', "

"'org.apache.hadoop.fs.FileSystemexistsFileSystem.java:1683', 'org.apache.hadoop.fs.s3a.S3AFileSystemexistsS3AFileSystem.java:2976', "

"'org.apache.hadoop.hive.ql.exec.UtilitiesensurePathIsWritableUtilities.java:4484', "

"'org.apache.hadoop.hive.ql.session.SessionStatecreateRootHDFSDirSessionState.java:731', "

"'org.apache.hadoop.hive.ql.session.SessionStatecreateSessionDirsSessionState.java:672', "

"'org.apache.hadoop.hive.ql.session.SessionStatestartSessionState.java:628', "

"'*com.amazonaws.services.s3.model.AmazonS3Exception:Gateway Time-out (Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway "

"Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: null)4419', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutorhandleErrorResponseAmazonHttpClient.java:1811', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutorhandleServiceErrorResponseAmazonHttpClient.java:1395', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutorexecuteOneRequestAmazonHttpClient.java:1371', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutorexecuteHelperAmazonHttpClient.java:1145', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutordoExecuteAmazonHttpClient.java:802', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutorexecuteWithTimerAmazonHttpClient.java:770', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutorexecuteAmazonHttpClient.java:744', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutor:access$500AmazonHttpClient.java704', "

"'com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImplexecuteAmazonHttpClient.java:686', "

"'com.amazonaws.http.AmazonHttpClientexecuteAmazonHttpClient.java:550', "

"'com.amazonaws.http.AmazonHttpClientexecuteAmazonHttpClient.java:530', "

"'com.amazonaws.services.s3.AmazonS3ClientinvokeAmazonS3Client.java:5062', "

"'com.amazonaws.services.s3.AmazonS3ClientinvokeAmazonS3Client.java:5008', "

"'com.amazonaws.services.s3.AmazonS3ClientinvokeAmazonS3Client.java:5002', "

"'com.amazonaws.services.s3.AmazonS3ClientlistObjectsV2AmazonS3Client.java:941', "

"'org.apache.hadoop.fs.s3a.S3AFileSystem:lambda$listObjects$5S3AFileSystem.java1262', "

"'org.apache.hadoop.fs.s3a.InvokerretryUntranslatedInvoker.java:322', "

"'org.apache.hadoop.fs.s3a.InvokerretryUntranslatedInvoker.java:285', "

"'org.apache.hadoop.fs.s3a.S3AFileSystemlistObjectsS3AFileSystem.java:1255', "

"'org.apache.hadoop.fs.s3a.S3AFileSystems3GetFileStatusS3AFileSystem.java:2223'], sqlState=None, errorCode=0, errorMessage='Failed to "

'open new session: org.apache.hadoop.fs.s3a.AWSS3IOException: getFileStatus on '

's3a://hivemr3-6054f65b-976f-4d25-8a0e-ea0a33898569/workdirtest: com.amazonaws.services.s3.model.AmazonS3Exception: Gateway Time-out '

'(Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: null), '

'S3 Extended Request ID: null:504 Gateway Time-out: Gateway Time-out (Service: Amazon S3; Status Code: 504; Error Code: 504 Gateway '

"Time-out; Request ID: null; S3 Extended Request ID: null; Proxy: null)'), serverProtocolVersion=9, sessionHandle=None, "

'configuration=None)\n'

'(Background on this error at: http://sqlalche.me/e/13/e3q8)\n'

'[2023-02-17 135342,619] INFO {datahub.entrypoints:187} - DataHub CLI version: 0.8.41 at '

'/tmp/datahub/ingest/venv-8fa49a4b-8775-4d62-874f-cbe22e5a07c8/lib/python3.9/site-packages/datahub/__init__.py\n'

'[2023-02-17 135342,619] INFO {datahub.entrypoints:190} - Python version: 3.9.9 (main, Dec 21 2021, 100334) \n'

'[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-8fa49a4b-8775-4d62-874f-cbe22e5a07c8/bin/python3 on '

'Linux-4.18.0-305.49.1.el8_4.x86_64-x86_64-with-glibc2.31\n'

"[2023-02-17 135342,619] INFO {datahub.entrypoints:193} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "

"'v0.8.41', 'commit': '6e07ec59242abf53e237183319a01ef3b1f708a9'}}, 'managedIngestion': {'defaultCliVersion': '0.8.41', 'enabled': True}, "

"'statefulIngestionCapable': True, 'supportsImpactAnalysis': False, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "

"'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'prod'}, 'noCode': 'true'}\n",

"2023-02-17 135343.873169 [exec_id=8fa49a4b-8775-4d62-874f-cbe22e5a07c8] INFO: Failed to execute 'datahub ingest'",

'2023-02-17 135343.877338 [exec_id=8fa49a4b-8775-4d62-874f-cbe22e5a07c8] INFO: Caught exception EXECUTING '

'task_id=8fa49a4b-8775-4d62-874f-cbe22e5a07c8, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'

' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'

' self.event_loop.run_until_complete(task_future)\n'

' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'

' return f.result()\n'

' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'

' raise self._exception\n'

' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'

' result = coro.send(None)\n'

' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'

' raise TaskError("Failed to execute \'datahub ingest\'")\n'

"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}```

better-table-69560

02/20/2023, 12:31 PM

hello every, who knows the airflow lineage best practice in datahub?

quaint-appointment-83049

02/20/2023, 12:36 PM

GLITCH IN DATAHUB INGESTION with GOOGLE BIGQUERY as a SOURCE Hi All, We are using Datahub for our Datacatalog and Datagovernance for our Analytics Domain where we are using Google cloud as a cloud service provider. Currently we are ingesting the data to Datahub using the REST APis with the pipeline created for Google Bigquery so that the ingestion happens successfully to Datahub. Here we are using the batch pipeline provided by Google cloud namely Cloud Run. The batch runs three times a week where it ingests all the datasets from Bigquery into Datahub. In this process, we identified a bug/glitch where the pipeline throws below error after 60 to 90 minutes. This is because, the Service account which we use for our Pipeline for ingesting the metadata is getting replaced with Environment variable Service account in the library code. This hinders the ingestion. In the BigQueryUsageConfig.py and BigQueryConfig.py as in the code pasted below, this section overwrites the GOOGLE_APPLICATION_CREDENTIALS parameter.

Copy code

def __init__(self, **data: Any):
    super().__init__(**data)

    if self.credential:
        self._credentials_path = self.credential.create_credential_temp_file()
        logger.debug(
            f"Creating temporary credential file at {self._credentials_path}"
        )
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = self._credentials_path

Please avoid replacing the GOOGLE_APPLICATION_CREDENTIALS parameter, as this gets used internally and used for other services propagation. It is not a good approach to overwrite it. Can someone help us to resolve this issue ASAP? Thank you in advance.

ambitious-shoe-92590

02/20/2023, 8:08 PM

This might have been answered in the past but I have a quick question regarding the

Struct

type with "pull" based ingestion. I have a nested field called

data

which contains a number of child key:values. When I ingest this data with datahub, the outputted dataset will have the data field, but it is un-expandable. I've read into Field Paths and the differences between v1 and v2 paths, but I am a bit confused on how to actually get to the point of being able to "expand" the nested struct. Seems like emitters are used in some examples but from my understanding that is if you want to manually add fields to the schema? Any help would be appreciated, the data is coming from a S3 source if that makes a difference.

✅ 1

lively-dusk-19162

02/20/2023, 8:15 PM

Hello all, I am facing the below error when I am re running datahub after making changes in metadata-models for creating new entity. I used the below command to build. ./gradlew metadata servicewar:build Could anyone please help me on the error?