colossal-easter-99672
11/19/2021, 7:23 AMfull-area-6720
11/19/2021, 8:29 AMfreezing-teacher-87574
11/19/2021, 9:26 AMhigh-hospital-85984
11/19/2021, 11:41 AMquiet-kilobyte-82304
11/19/2021, 6:13 PMred-pizza-28006
11/19/2021, 8:45 PMlimited-cricket-18852
11/23/2021, 12:16 AMdatahub ingest
, using the recipe YAML files, but in a python script, not using the CLI ?
In the documentation I found this example, but I would like to input YAML files, which I find better to version(ize) in git.agreeable-thailand-43234
11/23/2021, 2:18 AMdocker run --network=datahub_network -v /Users/edgar.valdez/Documents/datahub:/home/datahub linkedin/datahub-ingestion:head --debug delete --env DEV --entity_type pipeline -f
for entity_type
i tried pipeline, dataflow, datajob, but none of them worked
which is the right entity_type value i need in order to delete the entries
TIAwooden-gpu-7761
11/23/2021, 5:20 AMrhythmic-sundown-12093
11/23/2021, 6:22 AMAUTH_OIDC_SCOPE=openid
AUTH_OIDC_CLIENT_ID=xxxx
AUTH_OIDC_CLIENT_SECRET=xxx
AUTH_OIDC_DISCOVERY_URI=<http://xxx>:xxx/.well-known/openid-configuration
AUTH_OIDC_BASE_URL=<http://xxx:9002>
log:
! @7ln75h6n8 - Internal server error, for (GET) [/authenticate?redirect_uri=%2F] ->
play.api.UnexpectedException: Unexpected exception[CryptoException: Unable to execute 'doFinal' with cipher instance [javax.crypto.Cipher@2b4ff65a].]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:247)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:176)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:363)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:361)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:92)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:92)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:92)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.shiro.crypto.CryptoException: Unable to execute 'doFinal' with cipher instance [javax.crypto.Cipher@2b4ff65a].
at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:462)
at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:445)
at org.apache.shiro.crypto.JcaCipherService.decrypt(JcaCipherService.java:390)
at org.apache.shiro.crypto.JcaCipherService.decrypt(JcaCipherService.java:382)
at org.pac4j.play.store.ShiroAesDataEncrypter.decrypt(ShiroAesDataEncrypter.java:42)
at org.pac4j.play.store.PlayCookieSessionStore.get(PlayCookieSessionStore.java:60)
at org.pac4j.play.store.PlayCookieSessionStore.get(PlayCookieSessionStore.java:29)
at org.pac4j.core.client.IndirectClient.getRedirectAction(IndirectClient.java:102)
at org.pac4j.core.client.IndirectClient.redirect(IndirectClient.java:79)
at controllers.AuthenticationController.redirectToIdentityProvider(AuthenticationController.java:151)
at controllers.AuthenticationController.authenticate(AuthenticationController.java:85)
at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$4$$anonfun$apply$4.apply(Routes.scala:374)
at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$4$$anonfun$apply$4.apply(Routes.scala:374)
at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:134)
at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:133)
at play.core.routing.HandlerInvokerFactory$JavaActionInvokerFactory$$anon$8$$anon$2$$anon$1.invocation(HandlerInvoker.scala:108)
at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:88)
at play.http.DefaultActionCreator$1.call(DefaultActionCreator.java:31)
at play.core.j.JavaAction$$anonfun$9.apply(JavaAction.scala:138)
at play.core.j.JavaAction$$anonfun$9.apply(JavaAction.scala:138)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at play.core.j.HttpExecutionContext$$anon$2.run(HttpExecutionContext.scala:56)
at play.api.libs.streams.Execution$trampoline$.execute(Execution.scala:70)
at play.core.j.HttpExecutionContext.execute(HttpExecutionContext.scala:48)
at scala.concurrent.impl.Future$.apply(Future.scala:31)
at scala.concurrent.Future$.apply(Future.scala:494)
at play.core.j.JavaAction.apply(JavaAction.scala:138)
at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:96)
at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:89)
at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
at scala.util.Try$.apply(Try.scala:192)
at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:174)
at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:170)
at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
at play.api.libs.streams.StrictAccumulator.run(Accumulator.scala:207)
at play.core.server.AkkaHttpServer$$anonfun$14.apply(AkkaHttpServer.scala:357)
at play.core.server.AkkaHttpServer$$anonfun$14.apply(AkkaHttpServer.scala:355)
at akka.http.scaladsl.util.FastFuture$.akka$http$scaladsl$util$FastFuture$$strictTransform$1(FastFuture.scala:41)
at akka.http.scaladsl.util.FastFuture$$anonfun$transformWith$extension1$1.apply(FastFuture.scala:51)
at akka.http.scaladsl.util.FastFuture$$anonfun$transformWith$extension1$1.apply(FastFuture.scala:50)
... 13 common frames omitted
Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:620)
at com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1116)
at com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1053)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
at javax.crypto.Cipher.doFinal(Cipher.java:2168)
at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:459)
... 54 common frames omitted
damp-minister-31834
11/23/2021, 8:44 AMSchemaFieldDataType
. What steps should I do. I found the code related to schema is generated by avro_codegen.py, so I think the best way to expand SchemaFieldDataType
is adding a new .pdl
file to metadata-models/src/main/pegasus/com/linkedin/schema/
and then rebuild the project, right?boundless-student-48844
11/23/2021, 9:25 AMsql_common.py
, it first gets the list of all schemas (L344). Next, for each schema, get the list of tables (L408). Lastly, for each table, get column info (L322). That means, for each run of ingestion, it triggers at least N + M statements against SQL source, e.g. (DESCRIBE <table> in Hive), where N is the number of tables and M is the number of schemas.
In our case, we have over 80K
tables in Hive metastore. Empirically, we tried to ingest one big Hive schema with over 8K tables, and it took 2 hours to finish. And if we scale this duration linearly to 80K tables, that means in our case, each Hive ingestion would take 20 hours to finish, which is not acceptable.
What’s your thought or advice on this?orange-flag-48535
11/23/2021, 10:38 AMfull-area-6720
11/23/2021, 11:32 AMboundless-scientist-520
11/23/2021, 1:41 PMsource:
type: feast
config:
core_url: <s3://featurestore/dev/feature-store-dev.db>
env: "DEV"
use_local_build: False
sink:
type: "datahub-rest"
config:
server: "<http://datahub-datahub-gms.datahub.svc.cluster.local:8080>"
When I run the recipe, I get the following "ConnectionError":
docker.errors.ContainerError: Command 'python3 ingest.py --core_url=<s3://featurestore/dev/feature-store-dev.db> --output_path=/out.json' in image 'acryldata/datahub-ingestion-feast-wrapper' returned non-zero exit status 1: b'Traceback (most recent call last):\n File "/usr/local/lib/python3.8/site-packages/feast/grpc/grpc.py", line 48, in create_grpc_channel\n grpc.channel_ready_future(channel).result(timeout=timeout)\n File "/usr/local/lib/python3.8/site-packages/grpc/_utilities.py", line 140, in result\n self._block(timeout)\n File "/usr/local/lib/python3.8/site-packages/grpc/_utilities.py", line 86, in _block\n raise grpc.FutureTimeoutError()\ngrpc.FutureTimeoutError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "ingest.py", line 138, in <module>\n cli()\n File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__\n return self.main(*args, **kwargs)\n File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main\n rv = self.invoke(ctx)\n File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke\n return callback(*args, **kwargs)\n File "ingest.py", line 26, in cli\n tables = client.list_feature_tables()\n File "/usr/local/lib/python3.8/site-packages/feast/client.py", line 683, in list_feature_tables\n feature_table_protos = self._core_service.ListFeatureTables(\n File "/usr/local/lib/python3.8/site-packages/feast/client.py", line 134, in _core_service\n channel = create_grpc_channel(\n File "/usr/local/lib/python3.8/site-packages/feast/grpc/grpc.py", line 51, in create_grpc_channel\n raise ConnectionError(\nConnectionError: Connection timed out while attempting to connect to <s3://featurestore/dev/feature-store-dev.db>\n'
I'm using datahub 0.8.17 and feast 0.14.1.
In "core_url" I configured the feast featurestore .db file. I've seen in the Feast documentation that from version +0.10 there have been changes in Feast core:
"Feast Core was replaced by a file-based (S3, GCS) registry: Feast Core is a metadata server that maintains and exposes an API of feature definitions. With Feast 0.10, we've moved this entire service into a single flat file that can be stored on either the local disk or in a central object store like S3 or GCS. The benefit of this change is that users don't need to maintain a database and a registry service, yet they can still access all the metadata they had before."
I don't know if I am setting core_url correctly or will I need a Feast version lower than 0.10 with Feast core.
Does anyone have any ideas?rich-policeman-92383
11/23/2021, 2:25 PMpip3 install acryl-datahub
git clone <https://github.com/linkedin/datahub.git>
git checkout tags/v0.8.17 -b datahub_v0.8.17
cd metadata-ingestion/examples/library
$ python3 lineage_emitter_rest.py
Traceback (most recent call last):
File "lineage_emitter_rest.py", line 1, in <module>
import datahub.emitter.mce_builder as builder
ModuleNotFoundError: No module named 'datahub'
$ datahub version
DataHub CLI version: 0.8.17.0
Python version: 3.9.2 (default, Feb 24 2021, 13:26:01)
[Clang 11.0.0 (clang-1100.0.33.17)]
fancy-fireman-15263
11/23/2021, 9:31 PMdag.docs_md
as the description?some-crayon-90964
11/24/2021, 4:28 PMEntityClient
to ingest again right?red-pizza-28006
11/24/2021, 4:31 PMagreeable-hamburger-38305
11/24/2021, 6:37 PMcomputational.calzone.sample_table
.
profile_pattern:
allow:
- ^computational\.calzone\.sample_table$
and got this error
UnboundVariable: ': unbound variable'
Without the $ it works fine. Anyone know what might be the problem?magnificent-camera-71872
11/26/2021, 5:16 AMbillions-twilight-48559
11/27/2021, 9:19 AMbright-egg-4386
11/27/2021, 12:17 PMlimited-cricket-18852
11/27/2021, 12:35 PMbright-egg-4386
11/27/2021, 5:32 PMrich-policeman-92383
11/29/2021, 6:15 AMorange-flag-48535
11/29/2021, 6:42 AMmicroscopic-elephant-47912
11/29/2021, 10:55 AMdamp-minister-31834
11/29/2021, 11:53 AMmelodic-mouse-72847
11/30/2021, 8:47 AM[2021-11-30 11:32:03,180] INFO {datahub.ingestion.source.sql.sql_common:512} - Profiling example_db.table_example (this may take a while)
[2021-11-30 11:32:06,581] WARNING {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:09,996] WARNING {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:13,366] WARNING {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:16,821] WARNING {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:20,337] WARNING {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:23,738] WARNING {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:33:26,261] INFO {datahub.ingestion.run.pipeline:61} - sink wrote workunit profile-example_db.table_example
The stats & queries tabs are still inactive.
Here's my ingestion yml file
source:
type: athena
config:
aws_region: us-east-2
work_group: primary
username: ...
password: ...
database: example_db
s3_staging_dir: "..."
profiling:
enabled: true
sink:
type: "datahub-rest"
config:
server: "<http://localhost:8080>"
Am I doing smth wrong?