DataHub #ingestion

colossal-easter-99672

11/19/2021, 7:23 AM

Hello team. Is there any way to use non english characters?

full-area-6720

11/19/2021, 8:29 AM

Hi, what all data is accessible to datahub? Let's say I use my redshift credentials, what data does datahub have access too?

freezing-teacher-87574

11/19/2021, 9:26 AM

Hi Does Datahub work with Feast 0.14.1? Thanks.

high-hospital-85984

11/19/2021, 11:41 AM

👋 We're seeing some new errors from the Airflow lineage backend, using Airflow 2.2.2, originating from this line for DAGs where we set some value in the

params

-dict of the DAG. There seems to have been some change to how that argument is handled internally. I'm trying to look into it.

quiet-kilobyte-82304

11/19/2021, 6:13 PM

@polite-flower-25924 @loud-island-88694 following up our discussion in the townhall zoom chat. We’ve built a custom ingestion module to map table->table or table->view or basically object->object dependency in redshift. This is the only rule where we have a dataset to dataset mapping. Most cases its dataset -> airflow/databricks -> dataset. I can write something up but its fairly simple. We have our own SQL parser to capture view->view dependency, but if the upstream dataset is a table then you can use redshift’s stl_scan system table to get dependent objects(tables).

red-pizza-28006

11/19/2021, 8:45 PM

i have a question regarding airflow lineage - https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow. Is this the way to do this?

limited-cricket-18852

11/23/2021, 12:16 AM

Hello! Is there a way to do

datahub ingest

, using the recipe YAML files, but in a python script, not using the CLI ? In the documentation I found this example, but I would like to input YAML files, which I find better to version(ize) in git.

agreeable-thailand-43234

11/23/2021, 2:18 AM

Hello!! 👋 I successfully imported pipelines from AWS Glue (they're all custom scripts)..I'd like to delete some entries (as I changed the PATH) so now i'd like to delete either all of the object then re run the ingestion or delete the PATH that i don't need. I ran this command

Copy code

docker run --network=datahub_network -v /Users/edgar.valdez/Documents/datahub:/home/datahub linkedin/datahub-ingestion:head --debug delete --env DEV --entity_type pipeline -f

for

entity_type

i tried pipeline, dataflow, datajob, but none of them worked which is the right entity_type value i need in order to delete the entries TIA

wooden-gpu-7761

11/23/2021, 5:20 AM

Hi team! I’ve recently been meddling with the BigQuery ingestion module — I’m mostly interested in being able to parse lineage info from GCP audit logs and tables. In most cases I’m really happy with DataHub overall, but I have a couple of issues regarding the current state of lineage parsing. 1. Parsing lineage from a huge GCP project fails due to 503 errors on https://logging.googleapis.com/v2/entries:list, due to an internal timeout issue according to Google Support. It seems like fetching logs and aggregating them takes too long internally within GCP. I’ve tried to circumvent these issues by limiting page size, but with no success. -> What I think could be a good workaround would be to ingest audit logs from exported audit logs, which would in many cases would be stored in the form of BigQuery tables. 2. It is required that audit logs are in the same project as the dataset/tables to be ingested. I think separating dataset projects and query execution projects is a pretty common usecase, and maybe adding the option to define projects from which audit logs are to be extracted from would be pretty useful. Let me know what you think! I would be happy to contribute if possible 🙏 . Thanks!

rhythmic-sundown-12093

11/23/2021, 6:22 AM

Hi, team, I got an error when I configured the integrated oidc service.

Copy code

AUTH_OIDC_SCOPE=openid
AUTH_OIDC_CLIENT_ID=xxxx
AUTH_OIDC_CLIENT_SECRET=xxx
AUTH_OIDC_DISCOVERY_URI=<http://xxx>:xxx/.well-known/openid-configuration
AUTH_OIDC_BASE_URL=<http://xxx:9002>

log:

Copy code

! @7ln75h6n8 - Internal server error, for (GET) [/authenticate?redirect_uri=%2F] ->
 
play.api.UnexpectedException: Unexpected exception[CryptoException: Unable to execute 'doFinal' with cipher instance [javax.crypto.Cipher@2b4ff65a].]
	at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:247)
	at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:176)
	at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:363)
	at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:361)
	at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
	at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:92)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:92)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:92)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.shiro.crypto.CryptoException: Unable to execute 'doFinal' with cipher instance [javax.crypto.Cipher@2b4ff65a].
	at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:462)
	at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:445)
	at org.apache.shiro.crypto.JcaCipherService.decrypt(JcaCipherService.java:390)
	at org.apache.shiro.crypto.JcaCipherService.decrypt(JcaCipherService.java:382)
	at org.pac4j.play.store.ShiroAesDataEncrypter.decrypt(ShiroAesDataEncrypter.java:42)
	at org.pac4j.play.store.PlayCookieSessionStore.get(PlayCookieSessionStore.java:60)
	at org.pac4j.play.store.PlayCookieSessionStore.get(PlayCookieSessionStore.java:29)
	at org.pac4j.core.client.IndirectClient.getRedirectAction(IndirectClient.java:102)
	at org.pac4j.core.client.IndirectClient.redirect(IndirectClient.java:79)
	at controllers.AuthenticationController.redirectToIdentityProvider(AuthenticationController.java:151)
	at controllers.AuthenticationController.authenticate(AuthenticationController.java:85)
	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$4$$anonfun$apply$4.apply(Routes.scala:374)
	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$4$$anonfun$apply$4.apply(Routes.scala:374)
	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:134)
	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:133)
	at play.core.routing.HandlerInvokerFactory$JavaActionInvokerFactory$$anon$8$$anon$2$$anon$1.invocation(HandlerInvoker.scala:108)
	at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:88)
	at play.http.DefaultActionCreator$1.call(DefaultActionCreator.java:31)
	at play.core.j.JavaAction$$anonfun$9.apply(JavaAction.scala:138)
	at play.core.j.JavaAction$$anonfun$9.apply(JavaAction.scala:138)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at play.core.j.HttpExecutionContext$$anon$2.run(HttpExecutionContext.scala:56)
	at play.api.libs.streams.Execution$trampoline$.execute(Execution.scala:70)
	at play.core.j.HttpExecutionContext.execute(HttpExecutionContext.scala:48)
	at scala.concurrent.impl.Future$.apply(Future.scala:31)
	at scala.concurrent.Future$.apply(Future.scala:494)
	at play.core.j.JavaAction.apply(JavaAction.scala:138)
	at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:96)
	at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:89)
	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
	at scala.util.Try$.apply(Try.scala:192)
	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:174)
	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:170)
	at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
	at play.api.libs.streams.StrictAccumulator.run(Accumulator.scala:207)
	at play.core.server.AkkaHttpServer$$anonfun$14.apply(AkkaHttpServer.scala:357)
	at play.core.server.AkkaHttpServer$$anonfun$14.apply(AkkaHttpServer.scala:355)
	at akka.http.scaladsl.util.FastFuture$.akka$http$scaladsl$util$FastFuture$$strictTransform$1(FastFuture.scala:41)
	at akka.http.scaladsl.util.FastFuture$$anonfun$transformWith$extension1$1.apply(FastFuture.scala:51)
	at akka.http.scaladsl.util.FastFuture$$anonfun$transformWith$extension1$1.apply(FastFuture.scala:50)
	... 13 common frames omitted
Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
	at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:620)
	at com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1116)
	at com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1053)
	at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
	at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
	at javax.crypto.Cipher.doFinal(Cipher.java:2168)
	at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:459)
	... 54 common frames omitted

damp-minister-31834

11/23/2021, 8:44 AM

Hi, all! I added a source to datahub, and now I want to expand

SchemaFieldDataType

. What steps should I do. I found the code related to schema is generated by avro_codegen.py, so I think the best way to expand

SchemaFieldDataType

is adding a new

.pdl

file to

metadata-models/src/main/pegasus/com/linkedin/schema/

and then rebuild the project, right?

boundless-student-48844

11/23/2021, 9:25 AM

Hi team, we have some scalability concern of the logic used to ingest from SQL-based sources, eg hive. From what I read in the code from

sql_common.py

, it first gets the list of all schemas (L344). Next, for each schema, get the list of tables (L408). Lastly, for each table, get column info (L322). That means, for each run of ingestion, it triggers at least N + M statements against SQL source, e.g. (DESCRIBE <table> in Hive), where N is the number of tables and M is the number of schemas. In our case, we have

over 80K

tables in Hive metastore. Empirically, we tried to ingest one big Hive schema with over 8K tables, and it took 2 hours to finish. And if we scale this duration linearly to 80K tables, that means in our case, each Hive ingestion would take 20 hours to finish, which is not acceptable. What’s your thought or advice on this?

orange-flag-48535

11/23/2021, 10:38 AM

Hi, I'm trying to push an object graph into Datahub (a class hierarchy to be precise) and have run into a basic question about Datahub's support for custom types: Is it possible for a field within a schema to be of a custom type, so that when I click on the type's name it takes me to that type's definition? I've attached a screenshot so that it's more clear. In that pic, let's say AppFeatures is my class, and displayProperties is a field within that class. I want to represent that field as having the type DisplayProperties, and then be able to navigate to that definition of DisplayProperties by clicking on it. Is that possible in Datahub?

full-area-6720

11/23/2021, 11:32 AM

Hi, what is the procedure to use datahub in production?

boundless-scientist-520

11/23/2021, 1:41 PM

Hi, I'm trying to run a recipe for ingesting metadata from Feast:

Copy code

source:
  type: feast
  config:
    core_url: <s3://featurestore/dev/feature-store-dev.db> 
    env: "DEV"
    use_local_build: False
sink:
  type: "datahub-rest"
  config:
    server: "<http://datahub-datahub-gms.datahub.svc.cluster.local:8080>"

When I run the recipe, I get the following "ConnectionError":

Copy code

docker.errors.ContainerError: Command 'python3 ingest.py --core_url=<s3://featurestore/dev/feature-store-dev.db> --output_path=/out.json' in image 'acryldata/datahub-ingestion-feast-wrapper' returned non-zero exit status 1: b'Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/site-packages/feast/grpc/grpc.py", line 48, in create_grpc_channel\n    grpc.channel_ready_future(channel).result(timeout=timeout)\n  File "/usr/local/lib/python3.8/site-packages/grpc/_utilities.py", line 140, in result\n    self._block(timeout)\n  File "/usr/local/lib/python3.8/site-packages/grpc/_utilities.py", line 86, in _block\n    raise grpc.FutureTimeoutError()\ngrpc.FutureTimeoutError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "ingest.py", line 138, in <module>\n    cli()\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__\n    return self.main(*args, **kwargs)\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main\n    rv = self.invoke(ctx)\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke\n    return callback(*args, **kwargs)\n  File "ingest.py", line 26, in cli\n    tables = client.list_feature_tables()\n  File "/usr/local/lib/python3.8/site-packages/feast/client.py", line 683, in list_feature_tables\n    feature_table_protos = self._core_service.ListFeatureTables(\n  File "/usr/local/lib/python3.8/site-packages/feast/client.py", line 134, in _core_service\n    channel = create_grpc_channel(\n  File "/usr/local/lib/python3.8/site-packages/feast/grpc/grpc.py", line 51, in create_grpc_channel\n    raise ConnectionError(\nConnectionError: Connection timed out while attempting to connect to <s3://featurestore/dev/feature-store-dev.db>\n'

I'm using datahub 0.8.17 and feast 0.14.1. In "core_url" I configured the feast featurestore .db file. I've seen in the Feast documentation that from version +0.10 there have been changes in Feast core:

Copy code

"Feast Core was replaced by a file-based (S3, GCS) registry: Feast Core is a metadata server that maintains and exposes an API of feature definitions. With Feast 0.10, we've moved this entire service into a single flat file that can be stored on either the local disk or in a central object store like S3 or GCS. The benefit of this change is that users don't need to maintain a database and a registry service, yet they can still access all the metadata they had before."

I don't know if I am setting core_url correctly or will I need a Feast version lower than 0.10 with Feast core. Does anyone have any ideas?

rich-policeman-92383

11/23/2021, 2:25 PM

Hello Need help in testing https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py Here's what i have tried

Copy code

pip3 install acryl-datahub
git clone <https://github.com/linkedin/datahub.git>
git checkout tags/v0.8.17 -b datahub_v0.8.17
cd metadata-ingestion/examples/library

$ python3 lineage_emitter_rest.py
Traceback (most recent call last):
  File "lineage_emitter_rest.py", line 1, in <module>
    import datahub.emitter.mce_builder as builder
ModuleNotFoundError: No module named 'datahub'
$ datahub version
DataHub CLI version: 0.8.17.0
Python version: 3.9.2 (default, Feb 24 2021, 13:26:01)
[Clang 11.0.0 (clang-1100.0.33.17)]

fancy-fireman-15263

11/23/2021, 9:31 PM

Quick question - does the airflow lineage backend send

dag.docs_md

as the description?

👀 1

some-crayon-90964

11/24/2021, 4:28 PM

Question: If some metadata already exists in Datahub and we try to ingest exactly the same metadata, ideally GMS should ignore the new version when we use the

EntityClient

to ingest again right?

red-pizza-28006

11/24/2021, 4:31 PM

when i am building lineage, it seems to me that if the dataset does not exist, we create an empty dataset and then build the lineage. Is it possible to optionally turn off this feature and just fail lineage building instead?

agreeable-hamburger-38305

11/24/2021, 6:37 PM

Hi team! I am struggling with regex here 😂 I added the $ at the end because I want an exact match of

computational.calzone.sample_table

Copy code

profile_pattern:
      allow:
        - ^computational\.calzone\.sample_table$

and got this error

Copy code

UnboundVariable: ': unbound variable'

Without the $ it works fine. Anyone know what might be the problem?

magnificent-camera-71872

11/26/2021, 5:16 AM

Hi folks..... Are there any plans to support AWS LakeFormation as a datahub source. Our org is considering using LakeFormation chiefly for its centralised control of permissions and would like to feed this into datahub if possible.

billions-twilight-48559

11/27/2021, 9:19 AM

Hi team! Can I ingest a generic dataset from a schema defined in a yaml file? like you can do with glossary terms... We want to document an existing corporate self service portal where anyone can schedule periodic exports from APIs to data lakes, so we want to catalog which data this platform offers. I say generic dataset because it come from a custom technology and not a specific database.

bright-egg-4386

11/27/2021, 12:17 PM

Hello! I have problem with metadata ingestion from kafka-connect source. if my connector connects to sqlserver via jdbc then ingestion fails because of sqlserver jdbc connection string does not confront to rfc1738 spec. 😞 Has someone workaround for this?

limited-cricket-18852

11/27/2021, 12:35 PM

Hi there! I am using the Hive+Databricks ingest Source plugin, but it doesn’t extract the columns comments. Am I doing something wrong?

bright-egg-4386

11/27/2021, 5:32 PM

Hello! Also I have problems with ingesting kafka-connect metadata. I have constant error: Unable to emit metadata to DataHub GMS Failed to validate record with class /platform :: “Provided urn kafka” is invalid

rich-policeman-92383

11/29/2021, 6:15 AM

Hi Is there any flag to ignore errors while ingesting hive metadata using datahub cli.

orange-flag-48535

11/29/2021, 6:42 AM

Is it possible to include more than one MCE object in a single MCE JSON file? I'm currently looking at the source for file.py in metadata ingestion module and I have my doubts about this - https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/file.py#L62

microscopic-elephant-47912

11/29/2021, 10:55 AM

Hi team, I'm trying to ingest bigquery metadata and lineage information. I imported the metadata but i could not make it work the lineage part. We use separate projects for data and executions. In that case is it possible to ingest lineage information ? If yes how can I make it work ? Thanks a lot.

damp-minister-31834

11/29/2021, 11:53 AM

Hi all! Now datahub is integrated well with airflow. However, my company use dolphinscheduler not airflow. So I want to ask about the approximate steps to integrate datahub with a new scheduling framework.

melodic-mouse-72847

11/30/2021, 8:47 AM

Hello, I'm trying to ingest stats from AWS Athena to Datahub. I enabled profiling and get the following warning:

Copy code

[2021-11-30 11:32:03,180] INFO     {datahub.ingestion.source.sql.sql_common:512} - Profiling example_db.table_example (this may take a while)
[2021-11-30 11:32:06,581] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:09,996] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:13,366] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:16,821] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:20,337] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:32:23,738] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
[2021-11-30 11:33:26,261] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit profile-example_db.table_example

The stats & queries tabs are still inactive. Here's my ingestion yml file

Copy code

source:
  type: athena
  config:
    aws_region: us-east-2
    work_group: primary

    username: ...
    password: ...
    database: example_db

    s3_staging_dir: "..."
    
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

Am I doing smth wrong?