https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • c

    colossal-easter-99672

    11/19/2021, 7:23 AM
    Hello team. Is there any way to use non english characters?
    h
    • 2
    • 3
  • f

    full-area-6720

    11/19/2021, 8:29 AM
    Hi, what all data is accessible to datahub? Let's say I use my redshift credentials, what data does datahub have access too?
    b
    m
    • 3
    • 3
  • f

    freezing-teacher-87574

    11/19/2021, 9:26 AM
    Hi Does Datahub work with Feast 0.14.1? Thanks.
    p
    • 2
    • 1
  • h

    high-hospital-85984

    11/19/2021, 11:41 AM
    👋 We're seeing some new errors from the Airflow lineage backend, using Airflow 2.2.2, originating from this line for DAGs where we set some value in the
    params
    -dict of the DAG. There seems to have been some change to how that argument is handled internally. I'm trying to look into it.
    m
    • 2
    • 4
  • q

    quiet-kilobyte-82304

    11/19/2021, 6:13 PM
    @polite-flower-25924 @loud-island-88694 following up our discussion in the townhall zoom chat. We’ve built a custom ingestion module to map table->table or table->view or basically object->object dependency in redshift. This is the only rule where we have a dataset to dataset mapping. Most cases its dataset -> airflow/databricks -> dataset. I can write something up but its fairly simple. We have our own SQL parser to capture view->view dependency, but if the upstream dataset is a table then you can use redshift’s stl_scan system table to get dependent objects(tables).
    p
    d
    • 3
    • 8
  • r

    red-pizza-28006

    11/19/2021, 8:45 PM
    i have a question regarding airflow lineage - https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow. Is this the way to do this?
    p
    b
    +3
    • 6
    • 12
  • l

    limited-cricket-18852

    11/23/2021, 12:16 AM
    Hello! Is there a way to do
    datahub ingest
    , using the recipe YAML files, but in a python script, not using the CLI ? In the documentation I found this example, but I would like to input YAML files, which I find better to version(ize) in git.
    e
    • 2
    • 2
  • a

    agreeable-thailand-43234

    11/23/2021, 2:18 AM
    Hello!! 👋 I successfully imported pipelines from AWS Glue (they're all custom scripts)..I'd like to delete some entries (as I changed the PATH) so now i'd like to delete either all of the object then re run the ingestion or delete the PATH that i don't need. I ran this command
    Copy code
    docker run --network=datahub_network -v /Users/edgar.valdez/Documents/datahub:/home/datahub linkedin/datahub-ingestion:head --debug delete --env DEV --entity_type pipeline -f
    for
    entity_type
    i tried pipeline, dataflow, datajob, but none of them worked which is the right entity_type value i need in order to delete the entries TIA
    e
    m
    • 3
    • 66
  • w

    wooden-gpu-7761

    11/23/2021, 5:20 AM
    Hi team! I’ve recently been meddling with the BigQuery ingestion module — I’m mostly interested in being able to parse lineage info from GCP audit logs and tables. In most cases I’m really happy with DataHub overall, but I have a couple of issues regarding the current state of lineage parsing. 1. Parsing lineage from a huge GCP project fails due to 503 errors on https://logging.googleapis.com/v2/entries:list, due to an internal timeout issue according to Google Support. It seems like fetching logs and aggregating them takes too long internally within GCP. I’ve tried to circumvent these issues by limiting page size, but with no success. -> What I think could be a good workaround would be to ingest audit logs from exported audit logs, which would in many cases would be stored in the form of BigQuery tables. 2. It is required that audit logs are in the same project as the dataset/tables to be ingested. I think separating dataset projects and query execution projects is a pretty common usecase, and maybe adding the option to define projects from which audit logs are to be extracted from would be pretty useful. Let me know what you think! I would be happy to contribute if possible 🙏 . Thanks!
    h
    • 2
    • 2
  • r

    rhythmic-sundown-12093

    11/23/2021, 6:22 AM
    Hi, team, I got an error when I configured the integrated oidc service.
    Copy code
    AUTH_OIDC_SCOPE=openid
    AUTH_OIDC_CLIENT_ID=xxxx
    AUTH_OIDC_CLIENT_SECRET=xxx
    AUTH_OIDC_DISCOVERY_URI=<http://xxx>:xxx/.well-known/openid-configuration
    AUTH_OIDC_BASE_URL=<http://xxx:9002>
    log:
    Copy code
    ! @7ln75h6n8 - Internal server error, for (GET) [/authenticate?redirect_uri=%2F] ->
     
    play.api.UnexpectedException: Unexpected exception[CryptoException: Unable to execute 'doFinal' with cipher instance [javax.crypto.Cipher@2b4ff65a].]
    	at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:247)
    	at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:176)
    	at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:363)
    	at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:361)
    	at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
    	at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
    	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
    	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
    	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:92)
    	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:92)
    	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:92)
    	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
    	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
    	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
    	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    Caused by: org.apache.shiro.crypto.CryptoException: Unable to execute 'doFinal' with cipher instance [javax.crypto.Cipher@2b4ff65a].
    	at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:462)
    	at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:445)
    	at org.apache.shiro.crypto.JcaCipherService.decrypt(JcaCipherService.java:390)
    	at org.apache.shiro.crypto.JcaCipherService.decrypt(JcaCipherService.java:382)
    	at org.pac4j.play.store.ShiroAesDataEncrypter.decrypt(ShiroAesDataEncrypter.java:42)
    	at org.pac4j.play.store.PlayCookieSessionStore.get(PlayCookieSessionStore.java:60)
    	at org.pac4j.play.store.PlayCookieSessionStore.get(PlayCookieSessionStore.java:29)
    	at org.pac4j.core.client.IndirectClient.getRedirectAction(IndirectClient.java:102)
    	at org.pac4j.core.client.IndirectClient.redirect(IndirectClient.java:79)
    	at controllers.AuthenticationController.redirectToIdentityProvider(AuthenticationController.java:151)
    	at controllers.AuthenticationController.authenticate(AuthenticationController.java:85)
    	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$4$$anonfun$apply$4.apply(Routes.scala:374)
    	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$4$$anonfun$apply$4.apply(Routes.scala:374)
    	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:134)
    	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:133)
    	at play.core.routing.HandlerInvokerFactory$JavaActionInvokerFactory$$anon$8$$anon$2$$anon$1.invocation(HandlerInvoker.scala:108)
    	at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:88)
    	at play.http.DefaultActionCreator$1.call(DefaultActionCreator.java:31)
    	at play.core.j.JavaAction$$anonfun$9.apply(JavaAction.scala:138)
    	at play.core.j.JavaAction$$anonfun$9.apply(JavaAction.scala:138)
    	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    	at play.core.j.HttpExecutionContext$$anon$2.run(HttpExecutionContext.scala:56)
    	at play.api.libs.streams.Execution$trampoline$.execute(Execution.scala:70)
    	at play.core.j.HttpExecutionContext.execute(HttpExecutionContext.scala:48)
    	at scala.concurrent.impl.Future$.apply(Future.scala:31)
    	at scala.concurrent.Future$.apply(Future.scala:494)
    	at play.core.j.JavaAction.apply(JavaAction.scala:138)
    	at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:96)
    	at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:89)
    	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
    	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
    	at scala.util.Try$.apply(Try.scala:192)
    	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:174)
    	at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:170)
    	at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
    	at play.api.libs.streams.StrictAccumulator.run(Accumulator.scala:207)
    	at play.core.server.AkkaHttpServer$$anonfun$14.apply(AkkaHttpServer.scala:357)
    	at play.core.server.AkkaHttpServer$$anonfun$14.apply(AkkaHttpServer.scala:355)
    	at akka.http.scaladsl.util.FastFuture$.akka$http$scaladsl$util$FastFuture$$strictTransform$1(FastFuture.scala:41)
    	at akka.http.scaladsl.util.FastFuture$$anonfun$transformWith$extension1$1.apply(FastFuture.scala:51)
    	at akka.http.scaladsl.util.FastFuture$$anonfun$transformWith$extension1$1.apply(FastFuture.scala:50)
    	... 13 common frames omitted
    Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
    	at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:620)
    	at com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1116)
    	at com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1053)
    	at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
    	at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
    	at javax.crypto.Cipher.doFinal(Cipher.java:2168)
    	at org.apache.shiro.crypto.JcaCipherService.crypt(JcaCipherService.java:459)
    	... 54 common frames omitted
    b
    • 2
    • 5
  • d

    damp-minister-31834

    11/23/2021, 8:44 AM
    Hi, all! I added a source to datahub, and now I want to expand
    SchemaFieldDataType
    . What steps should I do. I found the code related to schema is generated by avro_codegen.py, so I think the best way to expand
    SchemaFieldDataType
    is adding a new
    .pdl
    file to
    metadata-models/src/main/pegasus/com/linkedin/schema/
    and then rebuild the project, right?
    e
    b
    g
    • 4
    • 32
  • b

    boundless-student-48844

    11/23/2021, 9:25 AM
    Hi team, we have some scalability concern of the logic used to ingest from SQL-based sources, eg hive. From what I read in the code from
    sql_common.py
    , it first gets the list of all schemas (L344). Next, for each schema, get the list of tables (L408). Lastly, for each table, get column info (L322). That means, for each run of ingestion, it triggers at least N + M statements against SQL source, e.g. (DESCRIBE <table> in Hive), where N is the number of tables and M is the number of schemas. In our case, we have
    over 80K
    tables in Hive metastore. Empirically, we tried to ingest one big Hive schema with over 8K tables, and it took 2 hours to finish. And if we scale this duration linearly to 80K tables, that means in our case, each Hive ingestion would take 20 hours to finish, which is not acceptable. What’s your thought or advice on this?
    b
    l
    +2
    • 5
    • 10
  • o

    orange-flag-48535

    11/23/2021, 10:38 AM
    Hi, I'm trying to push an object graph into Datahub (a class hierarchy to be precise) and have run into a basic question about Datahub's support for custom types: Is it possible for a field within a schema to be of a custom type, so that when I click on the type's name it takes me to that type's definition? I've attached a screenshot so that it's more clear. In that pic, let's say AppFeatures is my class, and displayProperties is a field within that class. I want to represent that field as having the type DisplayProperties, and then be able to navigate to that definition of DisplayProperties by clicking on it. Is that possible in Datahub?
    • 1
    • 3
  • f

    full-area-6720

    11/23/2021, 11:32 AM
    Hi, what is the procedure to use datahub in production?
    b
    • 2
    • 2
  • b

    boundless-scientist-520

    11/23/2021, 1:41 PM
    Hi, I'm trying to run a recipe for ingesting metadata from Feast:
    Copy code
    source:
      type: feast
      config:
        core_url: <s3://featurestore/dev/feature-store-dev.db> 
        env: "DEV"
        use_local_build: False
    sink:
      type: "datahub-rest"
      config:
        server: "<http://datahub-datahub-gms.datahub.svc.cluster.local:8080>"
    When I run the recipe, I get the following "ConnectionError":
    Copy code
    docker.errors.ContainerError: Command 'python3 ingest.py --core_url=<s3://featurestore/dev/feature-store-dev.db> --output_path=/out.json' in image 'acryldata/datahub-ingestion-feast-wrapper' returned non-zero exit status 1: b'Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/site-packages/feast/grpc/grpc.py", line 48, in create_grpc_channel\n    grpc.channel_ready_future(channel).result(timeout=timeout)\n  File "/usr/local/lib/python3.8/site-packages/grpc/_utilities.py", line 140, in result\n    self._block(timeout)\n  File "/usr/local/lib/python3.8/site-packages/grpc/_utilities.py", line 86, in _block\n    raise grpc.FutureTimeoutError()\ngrpc.FutureTimeoutError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "ingest.py", line 138, in <module>\n    cli()\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__\n    return self.main(*args, **kwargs)\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main\n    rv = self.invoke(ctx)\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke\n    return callback(*args, **kwargs)\n  File "ingest.py", line 26, in cli\n    tables = client.list_feature_tables()\n  File "/usr/local/lib/python3.8/site-packages/feast/client.py", line 683, in list_feature_tables\n    feature_table_protos = self._core_service.ListFeatureTables(\n  File "/usr/local/lib/python3.8/site-packages/feast/client.py", line 134, in _core_service\n    channel = create_grpc_channel(\n  File "/usr/local/lib/python3.8/site-packages/feast/grpc/grpc.py", line 51, in create_grpc_channel\n    raise ConnectionError(\nConnectionError: Connection timed out while attempting to connect to <s3://featurestore/dev/feature-store-dev.db>\n'
    I'm using datahub 0.8.17 and feast 0.14.1. In "core_url" I configured the feast featurestore .db file. I've seen in the Feast documentation that from version +0.10 there have been changes in Feast core:
    Copy code
    "Feast Core was replaced by a file-based (S3, GCS) registry: Feast Core is a metadata server that maintains and exposes an API of feature definitions. With Feast 0.10, we've moved this entire service into a single flat file that can be stored on either the local disk or in a central object store like S3 or GCS. The benefit of this change is that users don't need to maintain a database and a registry service, yet they can still access all the metadata they had before."
    I don't know if I am setting core_url correctly or will I need a Feast version lower than 0.10 with Feast core. Does anyone have any ideas?
    e
    c
    • 3
    • 2
  • r

    rich-policeman-92383

    11/23/2021, 2:25 PM
    Hello Need help in testing https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py Here's what i have tried
    Copy code
    pip3 install acryl-datahub
    git clone <https://github.com/linkedin/datahub.git>
    git checkout tags/v0.8.17 -b datahub_v0.8.17
    cd metadata-ingestion/examples/library
    
    $ python3 lineage_emitter_rest.py
    Traceback (most recent call last):
      File "lineage_emitter_rest.py", line 1, in <module>
        import datahub.emitter.mce_builder as builder
    ModuleNotFoundError: No module named 'datahub'
    $ datahub version
    DataHub CLI version: 0.8.17.0
    Python version: 3.9.2 (default, Feb 24 2021, 13:26:01)
    [Clang 11.0.0 (clang-1100.0.33.17)]
    b
    e
    • 3
    • 7
  • f

    fancy-fireman-15263

    11/23/2021, 9:31 PM
    Quick question - does the airflow lineage backend send
    dag.docs_md
    as the description?
    👀 1
    e
    • 2
    • 1
  • s

    some-crayon-90964

    11/24/2021, 4:28 PM
    Question: If some metadata already exists in Datahub and we try to ingest exactly the same metadata, ideally GMS should ignore the new version when we use the
    EntityClient
    to ingest again right?
    e
    • 2
    • 9
  • r

    red-pizza-28006

    11/24/2021, 4:31 PM
    when i am building lineage, it seems to me that if the dataset does not exist, we create an empty dataset and then build the lineage. Is it possible to optionally turn off this feature and just fail lineage building instead?
    e
    b
    s
    • 4
    • 14
  • a

    agreeable-hamburger-38305

    11/24/2021, 6:37 PM
    Hi team! I am struggling with regex here 😂 I added the $ at the end because I want an exact match of
    computational.calzone.sample_table
    .
    Copy code
    profile_pattern:
          allow:
            - ^computational\.calzone\.sample_table$
    and got this error
    Copy code
    UnboundVariable: ': unbound variable'
    Without the $ it works fine. Anyone know what might be the problem?
    m
    • 2
    • 2
  • m

    magnificent-camera-71872

    11/26/2021, 5:16 AM
    Hi folks..... Are there any plans to support AWS LakeFormation as a datahub source. Our org is considering using LakeFormation chiefly for its centralised control of permissions and would like to feed this into datahub if possible.
    w
    l
    • 3
    • 4
  • b

    billions-twilight-48559

    11/27/2021, 9:19 AM
    Hi team! Can I ingest a generic dataset from a schema defined in a yaml file? like you can do with glossary terms... We want to document an existing corporate self service portal where anyone can schedule periodic exports from APIs to data lakes, so we want to catalog which data this platform offers. I say generic dataset because it come from a custom technology and not a specific database.
    n
    • 2
    • 4
  • b

    bright-egg-4386

    11/27/2021, 12:17 PM
    Hello! I have problem with metadata ingestion from kafka-connect source. if my connector connects to sqlserver via jdbc then ingestion fails because of sqlserver jdbc connection string does not confront to rfc1738 spec. 😞 Has someone workaround for this?
    h
    g
    h
    • 4
    • 8
  • l

    limited-cricket-18852

    11/27/2021, 12:35 PM
    Hi there! I am using the Hive+Databricks ingest Source plugin, but it doesn’t extract the columns comments. Am I doing something wrong?
    m
    • 2
    • 5
  • b

    bright-egg-4386

    11/27/2021, 5:32 PM
    Hello! Also I have problems with ingesting kafka-connect metadata. I have constant error: Unable to emit metadata to DataHub GMS Failed to validate record with class /platform :: “Provided urn kafka” is invalid
    l
    m
    • 3
    • 3
  • r

    rich-policeman-92383

    11/29/2021, 6:15 AM
    Hi Is there any flag to ignore errors while ingesting hive metadata using datahub cli.
    g
    • 2
    • 2
  • o

    orange-flag-48535

    11/29/2021, 6:42 AM
    Is it possible to include more than one MCE object in a single MCE JSON file? I'm currently looking at the source for file.py in metadata ingestion module and I have my doubts about this - https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/file.py#L62
    b
    • 2
    • 2
  • m

    microscopic-elephant-47912

    11/29/2021, 10:55 AM
    Hi team, I'm trying to ingest bigquery metadata and lineage information. I imported the metadata but i could not make it work the lineage part. We use separate projects for data and executions. In that case is it possible to ingest lineage information ? If yes how can I make it work ? Thanks a lot.
    m
    • 2
    • 3
  • d

    damp-minister-31834

    11/29/2021, 11:53 AM
    Hi all! Now datahub is integrated well with airflow. However, my company use dolphinscheduler not airflow. So I want to ask about the approximate steps to integrate datahub with a new scheduling framework.
    g
    • 2
    • 18
  • m

    melodic-mouse-72847

    11/30/2021, 8:47 AM
    Hello, I'm trying to ingest stats from AWS Athena to Datahub. I enabled profiling and get the following warning:
    Copy code
    [2021-11-30 11:32:03,180] INFO     {datahub.ingestion.source.sql.sql_common:512} - Profiling example_db.table_example (this may take a while)
    [2021-11-30 11:32:06,581] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
    [2021-11-30 11:32:09,996] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
    [2021-11-30 11:32:13,366] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
    [2021-11-30 11:32:16,821] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
    [2021-11-30 11:32:20,337] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
    [2021-11-30 11:32:23,738] WARNING  {great_expectations.dataset.sqlalchemy_dataset:1715} - No recognized sqlalchemy types in type_list for current dialect.
    [2021-11-30 11:33:26,261] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit profile-example_db.table_example
    The stats & queries tabs are still inactive. Here's my ingestion yml file
    Copy code
    source:
      type: athena
      config:
        aws_region: us-east-2
        work_group: primary
    
        username: ...
        password: ...
        database: example_db
    
        s3_staging_dir: "..."
        
        profiling:
          enabled: true
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    Am I doing smth wrong?
    b
    a
    • 3
    • 3
1...192021...144Latest