https://datahubproject.io logo
Join SlackCommunities
Powered by
# troubleshoot
  • b

    bitter-lawyer-49179

    02/17/2023, 8:24 AM
    Hi everyone, I am facing a weird (and annoying) issue, which I'm not sure whether its to do with Datahub or Airflow (MWAA). I am trying to setup the airflow integration using kafka connection id. I followed all the steps on airflow integration documentation, but made a the mistake of setting connection id as
    datahub_rest_default
    while using
    datahub_kafka
    connection type. Obviously this caused issues while emitting metadata towards the end of my job. I noticed this and corrected the connection id to
    datahub_kafka_default
    . However, either airflow or the datahub plugin is not picking up on this change and is continuously looking for
    datahub_rest_default
    connection id instead of the corrected one. I deleted the connection and set it up again from scratch, but no luck.
    Copy code
    [2023-02-17, 08:12:03 UTC] {{logging_mixin.py:109}} INFO - Exception: Traceback (most recent call last):
      File "/usr/local/airflow/.local/lib/python3.7/site-packages/datahub_provider/_plugin.py", line 284, in custom_on_success_callback
        datahub_task_status_callback(context, status=InstanceRunResult.SUCCESS)
      File "/usr/local/airflow/.local/lib/python3.7/site-packages/datahub_provider/_plugin.py", line 136, in datahub_task_status_callback
        DatahubGenericHook(context["_datahub_config"].datahub_conn_id)
      File "/usr/local/airflow/.local/lib/python3.7/site-packages/datahub_provider/hooks/datahub.py", line 189, in get_underlying_hook
        conn = self.get_connection(self.datahub_conn_id)
      File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/hooks/base.py", line 68, in get_connection
        conn = Connection.get_connection_from_secrets(conn_id)
      File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/models/connection.py", line 410, in get_connection_from_secrets
        raise AirflowNotFoundException(f"The conn_id `{conn_id}` isn't defined")
    airflow.exceptions.AirflowNotFoundException: The conn_id `datahub_rest_default` isn't defined
    Any help with this much appreciated. PS: I tested this whole setup on a local instance of airflow first before I set it up on AWS MWAA. I did not face this issue on my local setup. It picked up kafka default connection and emitted metadata. (But I did not make the mistakes I did while configuring the connection id in my local)
    👀 1
    d
    • 2
    • 17
  • p

    powerful-telephone-2424

    02/17/2023, 9:40 AM
    Hi DataHub team, I’m seeing this weird error when trying to ingest Snowflake datasets:
    Copy code
    {snowflake.connector.CArrowIterator:372} - [Snowflake Exception] unknown arrow internal data type(0) for TIMESTAMP_LTZ data
    Weirdly it was working before this and then started failing on a second run a few mins later. Can anyone help me understand this issue? More detailed log:
    Copy code
    > datahub ingest -c ../../zwa-snowflake.recipe.yaml
    .
    .
    .
    [2023-02-17 01:35:46,224] INFO     {datahub.cli.ingest_cli:120} - Starting metadata ingestion
    \[2023-02-17 01:35:47,083] INFO     {datahub.ingestion.source.snowflake.snowflake_v2:1389} - Checking current version
    [2023-02-17 01:35:47,247] INFO     {datahub.ingestion.source.snowflake.snowflake_v2:1395} - Checking current role
    -[2023-02-17 01:35:47,328] INFO     {datahub.ingestion.source.snowflake.snowflake_v2:1401} - Checking current warehouse
    [2023-02-17 01:35:47,424] INFO     {datahub.ingestion.source.snowflake.snowflake_v2:1408} - Checking current edition
    |[2023-02-17 01:35:49,021] ERROR    {snowflake.connector.CArrowIterator:372} - [Snowflake Exception] unknown arrow internal data type(0) for TIMESTAMP_LTZ data
    \[2023-02-17 01:35:49,032] INFO     {datahub.ingestion.source.snowflake.snowflake_v2:627} - The role REVEFI_ROLE has `MANAGE GRANTS` privilege. This is not advisable and also not required.
    /[2023-02-17 01:35:49,565] ERROR    {snowflake.connector.CArrowIterator:372} - [Snowflake Exception] unknown arrow internal data type(0) for TIMESTAMP_LTZ data
    [2023-02-17 01:35:49,566] INFO     {datahub.ingestion.source.snowflake.snowflake_v2:627} - The role REVEFI_ROLE has `MANAGE GRANTS` privilege. This is not advisable and also not required.
    [2023-02-17 01:35:49,566] ERROR    {datahub.ingestion.source.snowflake.snowflake_v2:224} - permission-error => No databases found. Please check permissions.
    [2023-02-17 01:35:49,763] INFO     {datahub.cli.ingest_cli:133} - Finished metadata ingestion
    ✅ 1
    d
    • 2
    • 2
  • r

    rich-pager-68736

    02/17/2023, 9:44 AM
    Hi DataHub-Team, I just encountered a few Snowflake ingestion errors, namely 413 client errors. Is there any way to tweak this, so bigger requests are allowed?
    Copy code
    'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'message': '413 Client Error: Request Entity Too Large for url: '
                                       '<https://xxxxxxxxx/api/gms/aspects?action=ingestProposal>',
                            'id': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,YYYYYYYYYYYYYYYY,PROD)'}},
                  {'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'message': '413 Client Error: Request Entity Too Large for url: '
                                       '<https://xxxxxxxxx/api/gms/aspects?action=ingestProposal>',
                            'id': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,YYYYYYYYYYYYYYYY,PROD)'}},
    ✅ 1
    • 1
    • 2
  • b

    bitter-lawyer-49179

    02/17/2023, 11:01 AM
    airflow_mwaa_kafka_error
    airflow_mwaa_kafka_error
    ✅ 1
    a
    • 2
    • 2
  • r

    rhythmic-ram-79027

    02/17/2023, 1:03 PM
    Hi Team, Currently i am facing slowness issue in datahub UI even in login process, it took about 7 seconds. It also happens in another process like table search, open glossary term very slow response in UI and sometimes it throwed an error in UI [Failed to load results! An unexpected error occurred]. However there is no error in datahub-gms log, only following message i found in the log when the error message shown in UI: WARN o.s.w.s.m.s.DefaultHandlerExceptionResolver:208 - Resolved [org.springframework.web.context.request.async.AsyncRequestTimeoutException] I tried in both version v.0.9.6.1 and v0.10.0. Here is the detail version and spec i use: • DataHub v0.10.0 • In total there are 3.9K Datasets, 43 Glossary terms. • docker compose file https://github.com/datahub-project/datahub/blob/v0.10.0/docker/quickstart/docker-compose-without-neo4j.quickstart.yml Memory info: free -m total used free shared buff/cache available Mem: 7780 7447 109 0 223 186 Swap: 20391 4458 15933 Could you please advise. Thanks in advance!
    ✅ 1
    👀 1
    i
    • 2
    • 10
  • g

    glamorous-elephant-17130

    02/17/2023, 1:14 PM
    Guys getting this error when trying to install datahub-action[slack]
    ✅ 1
    r
    • 2
    • 1
  • e

    elegant-salesmen-99143

    02/17/2023, 3:54 PM
    Hi. We tried to enable Token-based authentication today, so we needed to add token to ingest recipies, if I understand correctly from documentation. I generated a token and added this to recipe:
    sink:
    type: datahub-rest
    config:
    server: '<https://datahub-stage.XXX.XX/gms>'
    token: '<[my token]!>'
    But I still get autharization error in ingest logs:
    Copy code
    '[2023-02-17 14:50:02,607] INFO    {datahub.ingestion.run.pipeline:196} - Source configured successfully.\n'
               '[2023-02-17 14:50:02,609] INFO    {datahub.cli.ingest_cli:120} - Starting metadata ingestion\n'
               '[2023-02-17 14:50:02,724] ERROR   {datahub.ingestion.run.pipeline:62} - failed to write record with workunit '
               "urn:li:container:ba0a76bc8027a651a2a4c28d02d447d8-containerProperties with ('Unable to emit metadata to DataHub GMS', {'message': '401 "
               "Client Error: Unauthorized for url: <https://datahub-stage.XXX.XX/gms/aspects?action=ingestProposal>', 'id': "
               "'urn:li:container:ba0a76bc8027a651a2a4c28d02d447d8'}) and info {'message': '401 Client Error: Unauthorized for url: "
               "<https://datahub-stage.XXX.XX/gms/aspects?action=ingestProposal>', 'id': 'urn:li:container:ba0a76bc8027a651a2a4c28d02d447d8'}\n"
               '[2023-02-17 14:50:02,730] ERROR   {datahub.ingestion.run.pipeline:62} - failed to write record with workunit '
               "urn:li:container:ba0a76bc8027a651a2a4c28d02d447d8-status with ('Unable to emit metadata to DataHub GMS', {'message': '401 Client Error: "
               "Unauthorized for url: <https://datahub-stage.XXX.XX/gms/aspects?action=ingestProposal>', 'id': "
               "'urn:li:container:ba0a76bc8027a651a2a4c28d02d447d8'}) and info {'message': '401 Client Error: Unauthorized for url: "
    What am I doing wrong? PS. We're on 0.9.6.1 if it's relevant
    s
    • 2
    • 3
  • a

    adorable-river-99503

    02/17/2023, 7:36 PM
    im having issues with ingestion. I have two docker set ups one for datahub and one for my DBT instance. Im wondering how to get the manifest and catalogue json from my dbt into the datahub docker quickstart?
    a
    • 2
    • 2
  • e

    eager-lunch-3897

    02/17/2023, 8:36 PM
    Hi - When deploying a fresh instance of Datahub v0.10.0 via the helm charts(not upgrading), do we need to set
    datahubUpgrade.enabled: false
    ? It defaults to true and we are running into an issue with the auth-secrets not existing yet and the job never running due to the pre-install helm hooks. Any ideas?
    i
    o
    +3
    • 6
    • 22
  • g

    glamorous-elephant-17130

    02/18/2023, 3:45 PM
    Following the guide- Deploying to AWS | DataHub (datahubproject.io) Tried multiple times but my prerequisites-cp-schema-registry keeps failing.
    o
    b
    +2
    • 5
    • 19
  • b

    bitter-lawyer-49179

    02/18/2023, 4:36 PM
    Hi everyone! Is there any other configuration that I need to setup in addition to following steps in this documentation for airflow? https://datahubproject.io/docs/lineage/airflow In my local setup where I'm simply running the
    lineage_backend_demo.py
    DAG, I dont see the airflow task & icon in between my upstream table (inlet) and the downstream (outlet). I thought that i'd see Airflow as a platform and see it as part of my lineage (see the lineage with redshift tables).
  • g

    gentle-lifeguard-88494

    02/18/2023, 4:39 PM
    Trying to figure out how to add a custom metadata model following these instructions: A Custom Metadata Model I've cloned the datahub repo and used ./gradlew build to build out the project. Now that it's built, I have been trying to get the custom aspect "customDataQualityRules" built. When I check localhost:8080/config I don't have any models currently. What I'm not sure about is how the model gets built. I will cd into the metadata-models-custom folder and then run ../gradlew build which will give me the output in the first screenshot. But the model doesn't update when I check localhost:8080/config. I was watching

    Upcoming Feature Alert!! No-Code Extensions in Datahub▾

    for help and saw that Shirshanka took a generated folder and put it in ~/.datahub/plugins/models/ I think I found the generated file in the dist folder, but when I copied it into ~/.datahub/plugins/models/ my localhost:8080/config still looks the same. I am not familiar with gradle or Java, so there could be something obvious that I'm missing. I also noticed the src folder updated with some new files. Not sure if there's anything I'm supposed to do with those either. I also confirmed that the aspect doesn't exist by running insert_custom_aspect.py. Anyway, I tried to explain as much as possible so hopefully its easy to see what I'm doing wrong. Any help would be appreciated to get this test sample working, thanks!
    ✅ 2
    o
    • 2
    • 32
  • b

    best-wire-59738

    02/20/2023, 3:47 AM
    Hello Team, Could you please help me which folder can I find all the ingestion logs in the action pod. we need to pull out those logs to a separate platform for easy debugging.
    o
    • 2
    • 3
  • c

    creamy-tent-10151

    02/20/2023, 5:06 AM
    Hi all, we are trying to set permissions for s3 entities, and figured the way to do this was through domains, however we have 48k+ entities and was wondering if there was a way to quickly add all s3 entities to one domain
    s
    • 2
    • 2
  • f

    flat-painter-78331

    02/20/2023, 7:42 AM
    Hi team, Good day!! I'm getting this error when trying to log in to datahub : "_*Failed to log in! An unexpected error occurred"*_ And these are the logs shown in my datahub-gsm and datahub-frontend pods in order
    Copy code
    AuthenticatorChain:80 - Authentication chain failed to resolve a valid authentication. Errors: [(com.datahub.authentication.authenticator.DataHubSystemAuthenticator,Failed to authenticate inbound request: Provided credentials do not match known system client id & client secret. Check your configuration values...), (com.datahub.authentication.authenticator.DataHubTokenAuthenticator,Failed to authenticate inbound request: Authorization header missing 'Bearer' prefix.)]
    Copy code
    ! @80jbm284n - Internal server error, for (POST) [/logIn] ->
    2/17/2023 3:08:25 PM 
    2/17/2023 3:08:25 PM play.api.UnexpectedException: Unexpected exception[RuntimeException: Failed to generate session token for user]
    2/17/2023 3:08:25 PM 	at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:358)
    2/17/2023 3:08:25 PM 	at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:264)
    2/17/2023 3:08:25 PM 	at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:436)
    2/17/2023 3:08:25 PM 	at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:428)
    2/17/2023 3:08:25 PM 	at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:417)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    2/17/2023 3:08:25 PM 	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63)
    2/17/2023 3:08:25 PM 	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100)
    2/17/2023 3:08:25 PM 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    2/17/2023 3:08:25 PM 	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
    2/17/2023 3:08:25 PM 	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100)
    2/17/2023 3:08:25 PM 	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49)
    2/17/2023 3:08:25 PM 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
    2/17/2023 3:08:25 PM 	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    2/17/2023 3:08:25 PM 	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    2/17/2023 3:08:25 PM 	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    2/17/2023 3:08:25 PM 	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    2/17/2023 3:08:25 PM 	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
    2/17/2023 3:08:25 PM Caused by: java.lang.RuntimeException: Failed to generate session token for user
    2/17/2023 3:08:25 PM 	at client.AuthServiceClient.generateSessionTokenForUser(AuthServiceClient.java:101)
    2/17/2023 3:08:25 PM 	at controllers.AuthenticationController.logIn(AuthenticationController.java:182)
    2/17/2023 3:08:25 PM 	at router.Routes$$anonfun$routes$1.$anonfun$applyOrElse$17(Routes.scala:581)
    2/17/2023 3:08:25 PM 	at play.core.routing.HandlerInvokerFactory$$anon$8.resultCall(HandlerInvoker.scala:150)
    2/17/2023 3:08:25 PM 	at play.core.routing.HandlerInvokerFactory$$anon$8.resultCall(HandlerInvoker.scala:149)
    2/17/2023 3:08:25 PM 	at play.core.routing.HandlerInvokerFactory$JavaActionInvokerFactory$$anon$3$$anon$4$$anon$5.invocation(HandlerInvoker.scala:115)
    2/17/2023 3:08:25 PM 	at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:119)
    2/17/2023 3:08:25 PM 	at play.http.DefaultActionCreator$1.call(DefaultActionCreator.java:33)
    2/17/2023 3:08:25 PM 	at play.core.j.JavaAction.$anonfun$apply$8(JavaAction.scala:175)
    2/17/2023 3:08:25 PM 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    2/17/2023 3:08:25 PM 	at scala.util.Success.$anonfun$map$1(Try.scala:255)
    2/17/2023 3:08:25 PM 	at scala.util.Success.map(Try.scala:213)
    2/17/2023 3:08:25 PM 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    2/17/2023 3:08:25 PM 	at play.core.j.HttpExecutionContext.$anonfun$execute$1(HttpExecutionContext.scala:64)
    2/17/2023 3:08:25 PM 	at play.api.libs.streams.Execution$trampoline$.execute(Execution.scala:70)
    2/17/2023 3:08:25 PM 	at play.core.j.HttpExecutionContext.execute(HttpExecutionContext.scala:59)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete(Promise.scala:372)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete$(Promise.scala:371)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise$KeptPromise$Successful.onComplete(Promise.scala:379)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise.transform(Promise.scala:33)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise.transform$(Promise.scala:31)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise$KeptPromise$Successful.transform(Promise.scala:379)
    2/17/2023 3:08:25 PM 	at scala.concurrent.Future.map(Future.scala:292)
    2/17/2023 3:08:25 PM 	at scala.concurrent.Future.map$(Future.scala:292)
    2/17/2023 3:08:25 PM 	at scala.concurrent.impl.Promise$KeptPromise$Successful.map(Promise.scala:379)
    2/17/2023 3:08:25 PM 	at scala.concurrent.Future$.apply(Future.scala:659)
    2/17/2023 3:08:25 PM 	at play.core.j.JavaAction.apply(JavaAction.scala:176)
    2/17/2023 3:08:25 PM 	at play.api.mvc.Action.$anonfun$apply$4(Action.scala:82)
    2/17/2023 3:08:25 PM 	at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
    2/17/2023 3:08:25 PM 	... 14 common frames omitted
    2/17/2023 3:08:25 PM Caused by: java.lang.RuntimeException: Bad response from the Metadata Service: HTTP/1.1 401 Unauthorized ResponseEntityProxy{[Content-Type: text/html;charset=iso-8859-1,Content-Length: 567,Chunked: false]}
    2/17/2023 3:08:25 PM 	at client.AuthServiceClient.generateSessionTokenForUser(AuthServiceClient.java:97)
    2/17/2023 3:08:25 PM 	... 46 common frames omitted
    Can anyone help me figure out what i should do here? Thanks
    ✅ 1
    s
    g
    +2
    • 5
    • 7
  • a

    acceptable-rain-30599

    02/20/2023, 8:17 AM
    Hey guys! I'm having an issue right now. I'm using dbt core and trying to add the source to datahub, and I have installed datahub on docker and dbt core locally, both on the same virtual machine. Then the error came out. I'm very confused why the folder can't be found, cause they are literally on the same server...Is there anyone trying to ingest from dbt?
    ✅ 1
    😭 1
    s
    • 2
    • 11
  • p

    powerful-cat-68806

    02/20/2023, 9:15 AM
    HI team, I’m getting an nginx
    404
    error when trying to access it publicly. I’ve recreated the ingress controller & re-assign R53 to the new endpoint. Same What’s the RC here? I presume the blocker is between the ingress endpoint & the app itself
    ✅ 1
    o
    • 2
    • 2
  • r

    rich-policeman-92383

    02/20/2023, 10:08 AM
    # cli/datahub version: v0.9.5 deleting platforms using datahub cli fails: 1. Biguqery(70.5k datasets): Fails with error "AssertionError: Did not delete all entities, please try running the command again" 2. Spark(19 entities): CLI is unable to find spark platform ## Command used
    Copy code
    datahub delete --hard -f -p "spark"
    datahub delete --hard -f -p "bigquery"
    a
    • 2
    • 3
  • a

    alert-printer-93847

    02/20/2023, 10:30 AM
    cross-posting https://datahubspace.slack.com/archives/CV2KB471C/p1676849695275449
    b
    • 2
    • 1
  • w

    wide-afternoon-79955

    02/20/2023, 2:41 PM
    Hey, I want to disable the
    Invite Users
    and disable User login using username and password. I just wanted OKTA authenticated users. Is it possible ? I tried disabling JAAS auth by setting
    AUTH_JAAS_ENABLED=false
    for
    datahub-frontend
    but its not working.
    ✅ 1
    b
    b
    • 3
    • 4
  • a

    alert-traffic-45034

    02/20/2023, 5:07 PM
    hi all , may I know anyone encountered this after upgrading to v0.10.0?
    Copy code
    Validation error (FieldUndefined@[searchResultFields/datasetProfiles/sizeInBytes]) : Field 'sizeInBytes' in type 'DatasetProfile' is undefined (code undefined)
    thanks
    b
    b
    • 3
    • 9
  • a

    ambitious-shoe-92590

    02/21/2023, 1:33 AM
    posting this here because it is probably a better channel for it. This might have been answered in the past but I have a quick question regarding the
    Struct
    type with "pull" based ingestion. I have a nested field called
    data
    which contains a number of child key:values. When I ingest this data with datahub, the outputted dataset will have the data field, but it is un-expandable. I've read into Field Paths and the differences between v1 and v2 paths, but I am a bit confused on how to actually get to the point of being able to "expand" the nested struct. Seems like emitters are used in some examples but from my understanding that is if you want to manually add fields to the schema? Any help would be appreciated, the data is coming from a S3 source if that makes a difference.
    b
    • 2
    • 16
  • g

    gray-ocean-32209

    02/21/2023, 8:09 AM
    Hi Team We use the DataHub Airflow lineage plugin for Airflow Datahub integration, we are trying to emit lineage using DatahubEmitterOperator (see the python script below), but it looks like it not working, we do not see any lineage info in DataJob mce event, inlet and outlet fields are empty Airflow ingestion log
    Copy code
    [2023-02-21, 07:13:46 UTC] {_plugin.py:179} INFO - Emitted Start Datahub Dataprocess Instance: DataProcessInstance(id='redshift_lineage_emission_dag_emit_lineage_manual__2023-02-21T07:13:31.971338+00:00', urn=<datahub.utilities.urns.data_process_instance_urn.DataProcessInstanceUrn object at 0xffff9a989ac0>, orchestrator='***', cluster='prod', type='BATCH_AD_HOC', template_urn=<datahub.utilities.urns.data_job_urn.DataJobUrn object at 0xffff864bb790>, parent_instance=None, properties={'run_id': 'manual__2023-02-21T07:13:31.971338+00:00', 'duration': '0.375603', 'start_date': '2023-02-21 07:13:45.274840+00:00', 'end_date': '2023-02-21 07:13:45.650443+00:00', 'execution_date': '2023-02-21 07:13:31.971338+00:00', 'try_number': '1', 'hostname': '05badb3885d0', 'max_tries': '1', 'external_executor_id': 'None', 'pid': '3840758', 'state': 'success', 'operator': 'DatahubEmitterOperator', 'priority_weight': '1', 'unixname': 'default', 'log_url': '<http://localhost:8080/log?execution_date=2023-02-21T07%3A13%3A31.971338%2B00%3A00&task_id=emit_lineage&dag_id=redshift_lineage_emission_dag&map_index=-1>'}, url='<http://localhost:8080/log?execution_date=2023-02-21T07%3A13%3A31.971338%2B00%3A00&task_id=emit_lineage&dag_id=redshift_lineage_emission_dag&map_index=-1>', inlets=[], outlets=[], upstream_urns=[])
    It works with BashOperator with setting inlets and outlets, we can see lineage in datahub lineage_emission_dag.py
    Copy code
    """Lineage Emission
    
    This example demonstrates how to emit lineage to DataHub within an Airflow DAG.
    """
    
    from datetime import timedelta
    
    import datahub.emitter.mce_builder as builder
    from airflow import DAG
    #from airflow.providers.amazon.aws.operators.redshift import RedshiftSQLOperator
    from contrib.operators.redshift_sql_operator import RedshiftSQLOperator
    from airflow.utils.dates import days_ago
    
    from datahub_provider.operators.datahub import DatahubEmitterOperator
    global_vars = {
        'REDSHIFT_CONN_ID': 'redshift_prod'
    }
    
    default_args = {
        "owner": "de",
        "depends_on_past": False,
        "email": ["<mailto:de-team@zynga.com|de-team@zynga.com>"],
        "email_on_failure": False,
        "email_on_retry": False,
        "retries": 1,
        "retry_delay": timedelta(minutes=5),
        "execution_timeout": timedelta(minutes=120),
    }
    
    with DAG(
            "lineage_emission_dag",
            default_args=default_args,
            description="An example DAG demonstrating lineage emission within an Airflow DAG.",
            schedule_interval=timedelta(days=1),
            start_date=days_ago(2),
            catchup=False,
    ) as dag:
        # This example shows a RedshiftSQLOperator followed by a lineage emission with DatahubEmitterOperator
    
        sql = """select count(*) as tables
                from information_schema.tables
                where table_type = 'BASE TABLE'"""
        redshift_transformation_task = RedshiftSQLOperator(
            task_id="redshift_transformation",
            dag=dag,
            redshift_conn_id=global_vars['REDSHIFT_CONN_ID'],
            sql=sql
        )
        # same DatahubEmitterOperator can be used to emit lineage in any context.
        datahub_emit_lineage_task = DatahubEmitterOperator(
            task_id="emit_lineage",
            datahub_conn_id="datahub_rest_default",
            mces=[
                builder.make_lineage_mce(
                    upstream_urns=[
                        builder.make_dataset_urn("redshift", "warehouse.zbatch.tableA"),
                        builder.make_dataset_urn("redshift", "warehouse.zbatch.tableB"),
                    ],
                    downstream_urn=builder.make_dataset_urn(
                        "redshift", "warehouse.zbatch.tableC"
                    ),
                )
            ]
        )
    
        redshift_transformation_task >> datahub_emit_lineage_task
    It looks DatahubEmitterOperator is not working with latest release our setup Datahub version 0.9.6 acryl-datahub 0.9.6 acryl-datahub-airflow-plugin 0.9.6 Airflow v2.3.4
  • g

    gray-ocean-32209

    02/21/2023, 8:14 AM
    lineage_emission_dag_airflow.log
    lineage_emission_dag_airflow.log
  • r

    rough-car-65301

    02/21/2023, 2:21 PM
    Hello team good afternoon, hope everything goes fine, team QQ: I just upgrade the CLI to the latest version and now every time I execute the emitter it shows me this error:
    Copy code
    ImportError: cannot import name 'make_avsc_object' from 'avro.schema'
    b
    • 2
    • 3
  • r

    rough-car-65301

    02/21/2023, 2:22 PM
    Could you please guide me in how can I solve this? Thanks in advance 🙂
  • a

    acceptable-nest-20465

    02/21/2023, 4:17 PM
    Hello team ,can someone guide me how to ingest metadata from ne04j similar to oracle into datahub
    ✅ 1
    a
    o
    • 3
    • 3
  • a

    alert-traffic-45034

    02/21/2023, 6:10 PM
    May I know how to do the cleanup of the frontend metadata? The services are hosted on k8s cluster and managed aws RDS, MSK and opensearch. With new RDS and MSK, and the deletion of all
    datahub*
    index in the opensearch, the front page like
    domains
    ,
    platforms
    ,
    tog tags
    ,etc , still exist. I am not sure where these info. got cache / persisted. thanks for guidance in advance
    b
    • 2
    • 2
  • f

    fierce-electrician-85924

    02/22/2023, 10:16 AM
    I am trying to add fine grained lineage at datajob level The lineage is something like dataset1.column1 ----------> datajob ---------------> dataset2.column2 This is the sample code for it
    Copy code
    fineGrainedLineages = [
        FineGrainedLineage(
            upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
            upstreams=[fldUrn("dataset1", "column1")],
            downstreamType=FineGrainedLineageDownstreamType.FIELD,
            downstreams=[fldUrn("dataset2", "column2")],
        )
    ]
    
    dataJobInputOutput = DataJobInputOutputClass(
        inputDatasets=[datasetUrn("dataset1")],
        outputDatasets=[datasetUrn("dataset2")],
        inputDatajobs=None,
        inputDatasetFields=[
            fldUrn("dataset1", "column1"),
        ],
        outputDatasetFields=[
            fldUrn("dataset2", "column2")
        ],
        fineGrainedLineages=fineGrainedLineages,
    )
    
    dataJobLineageMcp = MetadataChangeProposalWrapper(
        entityUrn=builder.make_data_job_urn("spark", "data-flow", "datajob1", "DEV"),
        aspect=dataJobInputOutput,
        changeType=ChangeTypeClass.UPSERT,
        entityType='datajob',
        aspectName='dataJobInputOutput'
    )
    
    # Create an emitter to the GMS REST API.
    emitter = DatahubRestEmitter("<http://localhost:8080>")
    
    # Emit metadata!
    result = emitter.emit_mcp(dataJobLineageMcp)
    but it's not showing lineage on UI, what could be the reason behind it? (I am using version 0.9.2)
    ✅ 1
    h
    • 2
    • 1
  • b

    big-fireman-42102

    02/22/2023, 12:20 PM
    Hi team, I have an error when deploying Datahub. I run
    datahub docker quickstart
    in the terminal and this error shows up
    Error response from daemon: driver failed programming external connectivity on endpoint datahub-gms (c7259513a4f59fb2e360eab2377b0ee38fce02b0f111b69e57eb22ad843d2411): Bind for 0.0.0.0:8080 failed: port is already allocated
    . I restarted the docker but it still same error. i am using mac M1. Does anyone have any idea why and how can fix it? Thanks in advance🙂
    h
    • 2
    • 16
1...787980...119Latest