DataHub #ingestion

alert-electrician-67912

01/30/2023, 4:12 PM

Hi team, I tried to ingest data linage using rest emitter https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py

Copy code

from typing import List

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.datajob import DataJobInputOutputClass

# Construct the DataJobInputOutput aspect.
input_datasets: List[str] = [
    builder.make_dataset_urn(platform="mysql", name="librarydb.member", env="PROD"),
    builder.make_dataset_urn(platform="mysql", name="librarydb.checkout", env="PROD"),
]

output_datasets: List[str] = [
    builder.make_dataset_urn(
        platform="kafka", name="debezium.topics.librarydb.member_checkout", env="PROD"
    )
]

input_data_jobs: List[str] = [
    builder.make_data_job_urn(
        orchestrator="airflow", flow_id="flow1", job_id="job0", cluster="PROD"
    )
]

datajob_input_output = DataJobInputOutputClass(
    inputDatasets=input_datasets,
    outputDatasets=output_datasets,
    inputDatajobs=input_data_jobs,
)

# Construct a MetadataChangeProposalWrapper object.
# NOTE: This will overwrite all of the existing lineage information associated with this job.
datajob_input_output_mcp = MetadataChangeProposalWrapper(
    entityUrn=builder.make_data_job_urn(
        orchestrator="airflow", flow_id="flow1", job_id="job1", cluster="PROD"
    ),
    aspect=datajob_input_output,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("<http://localhost:8080>")

# Emit metadata!
emitter.emit_mcp(datajob_input_output_mcp)

but I encounter some error:

Copy code

OperationalError(
datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: INTERNAL SERVER ERROR\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:210)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)\n\tat com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)\n\tat java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)\n\tat java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)\n\tat java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)\n\tat com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:262)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:294)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:262)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:232)\n\tat com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:215)\n\tat com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:171)\n\tat com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:130)\n\tat com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)\n\tat com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n\tat com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n\tat com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:106)\n\tat com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)\n\tat com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)\n\tat com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)\n\tat com.linkedin.r2.transport.http.server.AbstractR2Servlet.service(AbstractR2Servlet.java:105)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\tat com.linkedin.restli.server.spring.ParallelRestliHttpRequestHandler.handleRequest(ParallelRestliHttpRequestHandler.java:63)\n\tat org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:73)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)\n\tat org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1631)\n\tat com.datahub.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:88)\n\tat org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)\n\tat org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\n\tat org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.RuntimeException: Failed to validate entity URN urn:li:dataPlatform:TestPlatform\n\tat com.linkedin.metadata.utils.EntityKeyUtils.getUrnFromProposal(EntityKeyUtils.java:37)\n\tat com.linkedin.metadata.entity.AspectUtils.getAdditionalChanges(AspectUtils.java:39)\n\tat com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:145)\n\tat jdk.internal.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177)\n\t... 81 more\n, 'message': 'INTERNAL SERVER ERROR', 'status': 500})

bright-beard-86474

01/30/2023, 10:50 PM

Hi Team! I’m exploring the S3 DataLake module. I noticed that extensions are hardcode here. What if I have a

.tab

for tab separated files, not

tsv

. Is there a workaround? the other question is related to parquet files ingestion. In my case files do not have extension - how to process them? The

Header=True

is also hardcode, what if my files do not have them? Thanks!

future-analyst-98466

01/31/2023, 2:55 AM

i 'm ingesting file business_glossary.yml sample to data hub, but getting error "file or directory at path "/tmp/business_glossary.yml" does not exist ". Although this file is present in the /tmp directory. Help me find out the cause?

✅ 1

future-analyst-98466

01/31/2023, 2:55 AM

i'm following the guide: https://datahubproject.io/docs/generated/ingestion/sources/business-glossary#config-details

plus1 1

plain-france-42647

01/31/2023, 4:42 AM

Hi Team - i want to play with DataHub and run an extractor directly from Python code (i.e. the equivalent of creating a recipe and running it via datahub CLI) - are there any examples on how to do this? (i looked at the emitters examples, but these are just for sending data to DataHub, and i’m looking more into just collecting the data and parsing it as a beginning)

✅ 2

limited-forest-73733

01/31/2023, 6:52 AM

Hey team! We are facing jar packages vulnerabilities in datahub-actions, datahub-upgrade, datahub-ingestions, datahub-frontend, datahub-mae-consumer and datahub-mce-consumer images i.e. snakeyaml, jackson-mapper-asl, jackson-databind, log4j, hadoop-common and many others. We are going to release to production but stucked due to these vulnerabilities and we can’t fix it by ourself as well. Can anyone please help me out. Thanks in advance

microscopic-machine-90437

01/31/2023, 9:00 AM

Hi Team, I'm trying to delete metadata of dbt & snowflake. I have removed the ingestions from the UI and then deleted the metadata using the CLI from containers, entities, env, rollback and every possible way to delete. However, I could still see the DBT and Snowflake datasets in the UI. When I get inside, I don't see the Schema/other details, but I could see the lineage. Can someone suggest how I can get rid of the metadata from the UI. Thanks...!

✅ 1

high-nail-23255

01/31/2023, 9:17 AM

I'm trying to ingest hive metadata to datahub, but my hive enabled kerberos, i don't know how to config

best-umbrella-88325

01/31/2023, 9:43 AM

Hey Community! We are trying to ingest metadata from our Postgres source with profiling enabled. However, the ingestion process keeps failing with the error 'Too Many connections'. Does the ingestion job not close the connections while it is profiling data? When we disable profiling, the ingestion gets successful. Can someone help out with this please? Thanks in advance!

Copy code

connection to server at '
           '"<http://XXXX.rds.amazonaws.com|XXXX.rds.amazonaws.com>" (<http://XX.XXX.XXX.XXX|XX.XXX.XXX.XXX>), port 5432 failed: FATAL:  too many connections for '
           'role "XXX"\n

✅ 1

shy-dog-84302

01/31/2023, 12:13 PM

Hi! I am using Datahub 0.9.6.1. While trying to ingest Kafka metadata from UI I am getting the following error in 🧵 Looks like the ingestion executor is not able to write the recipe file to

/tmp

folder due to insufficient permissions. How can I fix this situation?

✅ 1

adventurous-angle-99988

01/31/2023, 3:05 PM

Hi Team, I have deployed datahub version

v0.9.6.1

using docker quick start. I'm facing issues while triggering ingestion for Looker. Getting

ERROR: No matching distribution found for acryl-datahub[datahub-kafka,datahub-rest,looker]==0.9.6-1

(Full log attached) Deployment details.

Copy code

OS: Ubuntu 18.04
DATAHUB_VERSION=v0.9.6.1 
ACTIONS_VERSION=v0.0.11

Would anyone be able to point me towards the right solution?

exec-urn_li_dataHubExecutionRequest_5d045605-4718-40df-964b-956906823c2a.log

✅ 1

best-umbrella-88325

01/31/2023, 3:59 PM

Hello Community! Got a question around S3 ingestion here. We're trying to ingest metadata from an S3 bucket which contains partitioned data, just like we have in Hive. We referred to this link https://datahubproject.io/docs/generated/ingestion/sources/s3 where it's mentioned to use the include path like this: <s3://datahub-ingestion-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.parquet%7Cs3://<bucket>/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.parquet> The directory structure looks something like this for our bucket

Copy code

table=test_tbl1
   - year=2022
     - month=01
       - day=15
         - file1.parquet
table=test_tbl2
   - year=2022
     - month=01
       - day=01
         - test1.parquet
       - day=02
         - test2.parquet
     - month=02
       - day=02
         - test3.parquet

But here's the question. We have 2 files in the month=01 directory for test_tbl2 and 1 file in month=02 directory for test_tbl2, but contents of file test1.parquet are only loaded, and the ingestion seems to ignore the other files in the bucket (test2.parquet and test3.parquet). We get metadata for test1.parquet and file1.parquet. Does anyone have any idea what we are missing here? S3 recipe being used is mentioned in 🧵

✅ 1

hallowed-farmer-50988

01/31/2023, 5:15 PM

Hello! I also have a question related to S3 ingestion. I’m currently trying to ingest a path where the files have no extension in the name. Those files are the output of an Athena CTAS query. The

path_spec.include

would be something like this

<s3://bucket-name/{table}/*>

, but that is not a valid path for the S3 source. Has anyone come across the same thing before? Any suggestions for ingesting this?

bland-barista-59197

01/31/2023, 7:28 PM

Hi Team, When I ingest table metadata from BigQuery with profiling enabled I do not see stats like min, max. could you please help me on this?

✅ 1

lively-dusk-19162

01/31/2023, 8:20 PM

Hello team, I am trying to ingest glue job metadata into datahub. I tried ingesting but I end up with urn showing up in UI. May I know Is there any other way to overcome that?

✅ 1

lively-dusk-19162

01/31/2023, 8:38 PM

Can anyone help me how it looks in UI when we ingests glue metadata using recipe?

✅ 1

elegant-state-4

01/31/2023, 8:58 PM

Hey folks! I am trying to publish metadata to datahub using the OpenAPI interface. I understand I need to generate the API and the model classes using a codegen tool of my choosing. Can someone recommend a good codegen tool they have used for this purpose and if possible sample code to demonstrate? I tried using the openapi-generator but I am running into some compilation issues. Any help would be appreaciated

✅ 1

numerous-ram-92457

01/31/2023, 9:34 PM

Hey all 👋🏼, having issues seeing column lineage flow from Snowflake through to Looker. We can see lineage for tables contained within Snowflake and lineage for Looker views/explores for things contained within Looker; however, no connection between the two. For the LookML ingestion, we’re using the UI and have a user account created with admin privileges. Is there anything else that could be impacting the full lineage connection from showing? All of our current ingestions are successfully running (Snowflake, Looker, and LookML).

✅ 1

limited-forest-73733

02/01/2023, 8:08 AM

Hey team we raised issue in github regarding vulnerabilities. Github links: GHSA-92v9-rh86-wgrv (spring web coming in a lot of images 1 C) GHSA-x8vw-cfpp-5wxg (apache airflow critical vuln) https://github.com/datahub-project/datahub/security/advisories/ GHSA-3p2c-f3j7-cxjm Jackson databind coming from pyspark site package”s Jar folder Can someone please help us out by prioritised our issue?

quiet-jelly-11365

02/01/2023, 9:10 AM

Hi all, is there a way to use "Pipeline Lineage" if we are not using airflow ?

✅ 1

plain-cricket-83456

02/01/2023, 9:43 AM

@hundreds-photographer-13496 Why did mysql scan all of test's databases when I specified that mysql scan all of Test's databases？As you can see from the screenshot below, I only need to scan the test database, but I still managed to scan 7 mysql databases including test.

✅ 1

adventurous-angle-99988

02/01/2023, 10:12 AM

Hi, Could anyone give a pointer on how to enable the

debug logs

in datahub-actions container? I'm trying to see get more information on an ingestion that I have configured from the UI. I'm running the datahub using docker. I can only see info logs and above in

/tmp/datahub/logs

✅ 1

bitter-evening-61050

02/01/2023, 11:48 AM

Hi, I am trying to ingest metadata from mysql to datahub. Mysql is running locally while the datahub is running in kubernates. While ingesting i am facing unauthorized error even when i have given the new token. error: datahub ingest -c mysql.dhub.yml --preview [2023-02-01 161523,692] INFO {datahub.cli.ingest_cli:165} - DataHub CLI version: 0.9.5 [2023-02-01 161525,178] INFO {datahub.ingestion.run.pipeline:179} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://xx.xx.xxx.xxx:8080 with token: Bear**********R8ec [2023-02-01 161526,867] INFO {datahub.ingestion.run.pipeline:196} - Source configured successfully. [2023-02-01 161526,875] INFO {datahub.cli.ingest_cli:120} - Starting metadata ingestion |[2023-02-01 161527,435] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit container-info-testdb-urnlicontainer:42195cf489264871079d12df84b9ce33 with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}) and info {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'} \[2023-02-01 161527,689] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit status-for-urnlicontainer:42195cf489264871079d12df84b9ce33 with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}) and info {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'} -[2023-02-01 161527,940] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit container-platforminstance-testdb-urnlicontainer:42195cf489264871079d12df84b9ce33 with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}) and info {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'} /[2023-02-01 161528,190] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit container-subtypes-testdb-urnlicontainer:42195cf489264871079d12df84b9ce33 with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}) and info {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'} |[2023-02-01 161528,442] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit container-urnlicontainer42195cf489264871079d12df84b9ce33 to urnlidataset(urnlidataPlatform:mysql,testdb.customer,PROD) with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'}) and info {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'} \[2023-02-01 161528,692] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit testdb.customer with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/entities?action=ingest', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'}) and info {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/entities?action=ingest', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'} -[2023-02-01 161528,939] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit testdb.customer-subtypes with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'}) and info {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'} /[2023-02-01 161529,188] INFO {datahub.cli.ingest_cli:133} - Finished metadata ingestion \ Cli report: {'cli_version': '0.9.5', 'cli_entry_location': 'xxxxx', 'py_version': '3.11.1 (tags/v3.11.1:a7a450f, Dec 6 2022, 195839) [MSC v.1934 64 bit (AMD64)]', 'py_exec_path': 'xxxxxx', 'os_details': 'xxxxxx', 'mem_info': '98.45 MB'} Source (mysql) report: {'events_produced': '7', 'events_produced_per_sec': '2', 'event_ids': ['container-info-testdb-urnlicontainer:42195cf489264871079d12df84b9ce33', 'status-for-urnlicontainer:42195cf489264871079d12df84b9ce33', 'container-platforminstance-testdb-urnlicontainer:42195cf489264871079d12df84b9ce33', 'container-subtypes-testdb-urnlicontainer:42195cf489264871079d12df84b9ce33', 'container-urnlicontainer42195cf489264871079d12df84b9ce33 to urnlidataset(urnlidataPlatform:mysql,testdb.customer,PROD)', 'testdb.customer', 'testdb.customer-subtypes'], 'warnings': {}, 'failures': {}, 'soft_deleted_stale_entities': [], 'tables_scanned': '1', 'views_scanned': '0', 'entities_profiled': '0', 'filtered': [], 'start_time': '2023-02-01 161526.441152 (3.23 seconds ago).', 'running_time': '3.23 seconds'} Sink (datahub-rest) report: {'total_records_written': '0', 'records_written_per_second': '0', 'warnings': [], 'failures': [{'error': 'Unable to emit metadata to DataHub GMS', 'info': {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}}, {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}}, {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}}, {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlicontainer:42195cf489264871079d12df84b9ce33'}}, {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'}}, {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/entities?action=ingest', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'}}, {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'message': '401 Client Error: Unauthorized for url: http://xx.xx.xxx.xxx:8080/aspects?action=ingestProposal', 'id': 'urnlidataset:(urnlidataPlatform:mysql,testdb.customer,PROD)'}}], 'start_time': '2023-02-01 161524.683178 (4.99 seconds ago).', 'current_time': '2023-02-01 161529.668782 (now).', 'total_duration_in_seconds': '4.99', 'gms_version': 'v0.9.5', 'pending_requests': '0'} Pipeline finished with at least 7 failures; produced 7 events in 3.23 seconds. can anyone please help me in solving this issue

agreeable-cricket-61480

02/01/2023, 11:57 AM

I am unable to enable mysql plugin. I have tried to use pip install acryl-datahub[mysql]

✅ 1

lemon-scooter-69730

02/01/2023, 12:09 PM

I think I am missing a library somewhere. (

No JVM shared library file (libjvm.so)

)

Copy code

'http://<redacted>:8083 is ok\n'
           '[2023-02-01 11:59:38,228] ERROR    {datahub.entrypoints:213} - Command failed: Failed to configure the source (kafka-connect): No JVM '
           'shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.\n'
           'Traceback (most recent call last):\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 114, in '
           '_add_init_error_context\n'
           '    yield\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 192, in '
           '__init__\n'
           '    self.source = source_class.create(\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/ingestion/source/kafka_connect.py", line 944, '
           'in create\n'
           '    return cls(config, ctx)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/ingestion/source/kafka_connect.py", line 939, '
           'in __init__\n'
           '    jpype.startJVM()\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/jpype/_core.py", line 184, in startJVM\n'
           '    jvmpath = getDefaultJVMPath()\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/jpype/_jvmfinder.py", line 74, in getDefaultJVMPath\n'
           '    return finder.get_jvm_path()\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/jpype/_jvmfinder.py", line 212, in get_jvm_path\n'
           '    raise JVMNotFoundException("No JVM shared library file ({0}) "\n'
           'jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable '
           'properly.\n'
           '\n'
           'The above exception was the direct cause of the following exception:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/entrypoints.py", line 171, in main\n'
           '    sys.exit(datahub(standalone_mode=False, **kwargs))\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/click/core.py", line 1130, in __call__\n'
           '    return self.main(*args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/click/core.py", line 1055, in main\n'
           '    rv = self.invoke(ctx)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/click/core.py", line 1657, in invoke\n'
           '    return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/click/core.py", line 1657, in invoke\n'
           '    return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/click/core.py", line 1404, in invoke\n'
           '    return ctx.invoke(self.callback, **ctx.params)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/click/core.py", line 760, in invoke\n'
           '    return __callback(*args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func\n'
           '    return f(get_current_context(), *args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 350, in wrapper\n'
           '    raise e\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 302, in wrapper\n'
           '    res = func(*args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, '
           'in wrapper\n'
           '    return func(ctx, *args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 179, in run\n'
           '    pipeline = Pipeline.create(\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 303, in '
           'create\n'
           '    return cls(\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 191, in '
           '__init__\n'
           '    with _add_init_error_context(f"configure the source ({source_type})"):\n'
           '  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__\n'
           '    self.gen.throw(typ, value, traceback)\n'
           '  File "/tmp/datahub/ingest/venv-kafka-connect-0.9.6/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 116, in '
           '_add_init_error_context\n'
           '    raise PipelineInitError(f"Failed to {step}: {e}") from e\n'
           'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (kafka-connect): No JVM shared library file (libjvm.so) '
           'found. Try setting up the JAVA_HOME environment variable properly.\n',
           "2023-02-01 11:59:38.618950 [exec_id=8e4beb30-8808-40f3-aae5-3122b48f689c] INFO: Failed to execute 'datahub ingest'",
           '2023-02-01 11:59:38.626712 [exec_id=8e4beb30-8808-40f3-aae5-3122b48f689c] INFO: Caught exception EXECUTING '
           'task_id=8e4beb30-8808-40f3-aae5-3122b48f689c, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

✅ 1

aloof-dentist-85908

02/01/2023, 2:05 PM

Hi all, does anyone know whether it is possible to filter the Superset dashboards and charts during ingestion? E.g. to only ingest dashboards that are published? Thanks! :-)

✅ 1

crooked-dinner-59545

02/01/2023, 2:37 PM

I’m having trouble connecting DataHub to our Delta lake. First we ran into the issue with regions that I believe someone else here had experienced as well, but after managing that got stuck once again with some other issue. Now we are receiving the following seemingly authentication related error when trying to ingest data from Delta lake :

Copy code

Command failed: Failed to load checkpoint: Failed to read checkpoint content: Generic S3 error: Error performing get request application2/delta/_delta_log/_last_checkpoint: response error "<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message>

We are using IAM roles for authentication and with the S3 ingestions (also from these same exact folders) works without a hitch, so I’m having trouble comprehending how this could be due to some authentication issue. The complete recipe looks like this:

Copy code

source:
    type: delta-lake
    config:
        env: dev
        base_path: '<s3://dataengineersandbox-public-bucket/application2/delta/>'
        s3:
            aws_config:
                aws_region: eu-west-1

Has anyone experienced similar issues? Any suggestions where to look to resolve this?

✅ 1

kind-sunset-55628

02/01/2023, 3:46 PM

Hi all, I wanted to do iceberg s3 tables ingestion in Datahub. The doc says "The current implementation of the Iceberg source plugin will only discover tables stored in a local file system or in ADLS. Support for S3 could be added fairly easily.". How can we add support for s3.

✅ 1

lemon-daybreak-58504

02/01/2023, 5:06 PM

Hi everyone, i'm trying to ingest from a postgres bd, datahub is running through kubernetes and i connect to the bd instance by cloud_sql_proxy how should be the yaml for this

rich-state-73859

02/01/2023, 7:12 PM

Got this error when running

../gradlew :metadata-ingestion:installDev

, it seems feast couldn’t be installed.

Copy code

Collecting feast~=0.26.0
  Using cached feast-0.26.0.tar.gz (3.6 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'error'
  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in feast setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

✅ 1