DataHub #troubleshoot

astonishing-dusk-99990

04/10/2023, 9:13 AM

Hi, is someone know how to make static ip on datahub frontend through ingress when deploy using helm chart? Currently my yaml on datahub frontend look like this

Copy code

# Set up ingress to expose react front-end
  ingress:
    enabled: true
    podAnnotations:
      <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: "gce-internal"
      <http://kubernetes.io/ingress.regional-static-ip-name|kubernetes.io/ingress.regional-static-ip-name>: "your-domain-name-internal-address"
    hosts:
    - host: your-domain-name
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: datahub-frontend
                port:
                  name: http
      #path: /
      #redirectPaths: []

  service:
    type: NodePort # ClusterIP or NodePort
    port: 9002
    targetPort: http
    protocol: TCP
    name: http
    annotations:
      <http://cloud.google.com/neg|cloud.google.com/neg>: '{"ingress": true}'
    # annotations:
    #   <http://networking.gke.io/load-balancer-type|networking.gke.io/load-balancer-type>: Internal

Since in service section we can’t using

arg loadBalancerIP

, is there anyway to make datahub front end from dynamic IP to static IP when we deploy using helm chart? Also when I tried to do helm upgrade it always got an error look like this

Error: UPGRADE FAILED: error validating "": error validating data: ValidationError(Ingress.spec.rules[0].http): missing required field "paths" in io.k8s.api.networking.v1.HTTPIngressRuleValue

Anyone know the problems and how to fix it? Notes: • Image datahub v0.10.0

🔍 1

📖 1

✅ 1

best-umbrella-88325

04/10/2023, 12:03 PM

Hello Community! I'm trying to build the docker image for datahub-actions after making a few changes. I've created the image using the command

Copy code

docker build -f docker/datahub-actions/Dockerfile . --no-cache

as mentioned in the documentation. Once I use this in my helm chart, I get the following error from the actions pod:

Copy code

2023/04/10 11:59:06 Waiting for: <http://datahub-datahub-gms:8080/health>
2023/04/10 11:59:06 Received 200 from <http://datahub-datahub-gms:8080/health>
2023/04/10 11:59:06 Error starting command: `/start_datahub_actions.sh` - fork/exec /start_datahub_actions.sh: no such file or directory

Can someone help me with this? Thanks in advance..

victorious-planet-2053

04/10/2023, 1:19 PM

Hi! Tell me please, how to delete objects that was added by ingestion? On a filter page I see "This action is not supported for the selected types."

✅ 1

handsome-football-66174

04/10/2023, 5:21 PM

Hi Team, Trying to use OpenAPI /entities endpoint to ingest Metadata. When I go through the documentation, looks like we are able to ingest one Metadata Aspect at a time , like SchemaMetadata . If we need to add tags etc to the Datasets, will this need to be ingested separately ?

proud-printer-88070

04/11/2023, 3:15 AM

Hello DataHub, I am getting an error when I try to ingest a file into datahub GMS via the CLI. It seems that the issue is related to configuration (it's the first time we are trying to do this). The command I am issuing is:

Copy code

python3 -m datahub ingest -c source.yml

The log is attached as

cli-error-log.txt

my .datahubenv looks something like this:

Copy code

gms:
  server: https://<<<gms-host>>>.<http://us-east-1.elb.amazonaws.com:8080|us-east-1.elb.amazonaws.com:8080>
  token: <<<token>>>

And I can curl the following URL successfully:

Copy code

curl http://<<<gms-host>>>.<http://us-east-1.elb.amazonaws.com:8080/config|us-east-1.elb.amazonaws.com:8080/config>
{
  "models" : { },
  "patchCapable" : true,
  "versions" : {
    "linkedin/datahub" : {
      "version" : "v0.10.0",
      "commit" : "cf1e627e55431fc69d72918b2bcc3c5f3a1d5002"
    }
  },
  "managedIngestion" : {
    "defaultCliVersion" : "0.10.0",
    "enabled" : true
  },
  "statefulIngestionCapable" : true,
  "supportsImpactAnalysis" : true,
  "telemetry" : {
    "enabledCli" : true,
    "enabledIngestion" : false
  },
  "datasetUrnNameCasing" : false,
  "retention" : "true",
  "datahub" : {
    "serverType" : "prod"
  },
  "noCode" : "true"
}

I looked at this post: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#your-proxy-appears-to-only-use-http-and-not-https In my setup, there are no env vars setup for HTTP_PROXY or HTTPS_PROXY. The error happens when trying to access the /config endpoint and says

try changing your proxy URL to be HTTP

GMS is installed in a kubernetes pod, in a production environment and we are in a VPN while running the above commands. Thanks !

cli-error-log.txt

📖 1

🔍 1

mysterious-scooter-52411

04/11/2023, 7:27 AM

./gradlew quickstart takes more than 30 minutes to execute. Is this normal ? Is there a way to make it fast

📖 1

🔍 1

colossal-waitress-83487

04/11/2023, 10:51 AM

Hello DataHub,How to query all Ingestionsources using graphql or other means

🔍 1

📖 1

elegant-salesmen-99143

04/11/2023, 12:12 PM

Hi all. We recently upgraded our stage environment from 0.9.6.1 to 10.1 and after that it seems like entities that have been soft-deleted are appearing again as if they've never been deleted. Any idea what might have caused that, and how can we get them back to being soft-deleted? We're using kubernetes and datahub helm chart, and restore-indices job has run successfully

🔍 1

📖 1

eager-animal-48107

04/11/2023, 4:27 PM

Hi Team, We are getting following error when we try to ingest from iceberg.

eager-animal-48107

04/11/2023, 4:28 PM

Copy code

ERROR: could not serialize access due to concurrent update  Call getNextException to see other errors in the batch.
	at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:165)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:559)
	at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:887)
	at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:910)
	at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1649)
	at io.ebean.datasource.delegate.PreparedStatementDelegator.executeBatch(PreparedStatementDelegator.java:357)
	at io.ebeaninternal.server.persist.BatchedPstmt.executeAndCheckRowCounts(BatchedPstmt.java:130)
	at io.ebeaninternal.server.persist.BatchedPstmt.executeBatch(BatchedPstmt.java:97)
	at io.ebeaninternal.server.persist.BatchedPstmtHolder.flush(BatchedPstmtHolder.java:124)
	at io.ebeaninternal.server.persist.BatchControl.flushPstmtHolder(BatchControl.java:206)
	at io.ebeaninternal.server.persist.BatchControl.executeNow(BatchControl.java:220)
	at io.ebeaninternal.server.persist.BatchedBeanHolder.executeNow(BatchedBeanHolder.java:100)
	at io.ebeaninternal.server.persist.BatchControl.flush(BatchControl.java:271)
	at io.ebeaninternal.server.persist.BatchControl.flush(BatchControl.java:227)
	at io.ebeaninternal.server.transaction.JdbcTransaction.batchFlush(JdbcTransaction.java:678)
	... 101 common frames omitted
Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365)
	... 115 common frames omitted

flat-engineer-75197

04/11/2023, 5:26 PM

👋 Is there a way to pull all glossary terms via the Python SDK? The closest thing I’ve seen is this but it is entity-specific. I want to grab ALL terms. https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L206

📖 1

🔍 1

cuddly-butcher-39945

04/11/2023, 7:10 PM

Hi team, experiencing this issue on my Docker Desktop environment on a Mac M1. @brainy-tent-14503 I’m posting the image to see if this can help based on our conversation this morning, thanks!!

📖 1

🔍 1

✅ 1

best-eve-12546

04/11/2023, 9:24 PM

Hi y’all, not sure if I missed any documentation, but I’m trying to use

datahub delete

to delete datasets with a specific schema. Looking at https://datahubproject.io/docs/how/delete-metadata/ it looks like it supports a query operator, but I couldn’t figure out exactly how to use it. i.e. I’m trying to do something like

Copy code

datahub delete --enitity_type dataset --env PROD --query "thisschema"

To delete

Copy code

urn:li:dataset:(urn:li:dataPlatform:platform,thisschema.table1,PROD)
urn:li:dataset:(urn:li:dataPlatform:platform,thisschema.table2,PROD)

but NOT

Copy code

urn:li:dataset:(urn:li:dataPlatform:platform,wrong_schema.thisschema,PROD)

The query operator seems to match all 3 since the target string is in the table-name. Is this possible?

📖 1

🔍 1

incalculable-zebra-69091

04/12/2023, 3:55 AM

Hi Team, I'm trying to run

datahub docker quickstart --version=v0.10.1

(datahub version 0.10.1). but when i sign in (GUI) have error /track and /login. i check log container datahub-frontend-react have error "[kafka-producer-network-thread | datahub-frontend] WARN o.apache.kafka.clients.NetworkClient - [Producer clientId=datahub-frontend] Connection to node -1 (broker/172.18.0.6:29092) could not be established. Broker may not be available", and datahub-gms have error. I need to be able to sign in ?

🔍 1

plus1 1

📖 1

🤒 1

able-city-76673

04/12/2023, 6:01 AM

https://datahubspace.slack.com/archives/CV2UVAPPG/p1681279205072399

✅ 1

microscopic-room-90690

04/12/2023, 6:59 AM

Hi team, for some reason, I use v0.9.6.1 in dev env and use v0.8.43 in prod env and there are 3000 tables in dev and 5000 tables in prod from hive source. To ingest the metadata into Datahub, it takes about 1h in dev, while more than 5 days in prod. I'm wondering what caused the huge difference. Does it have anything to do with the version or how should I troubleshoot?

Copy code

[2023-03-31 14:47:57,830] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.43
[2023-04-04 10:50:44,875] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion
Command exiting with ret '0'

📖 1

🔍 1

few-carpenter-93837

04/12/2023, 9:54 AM

Hi, can anyone confirm that with DataHub Tableau integration, they have successfully got the new project_patterns to work (using the allow, deny configurations in recipe)?

steep-fountain-54482

04/12/2023, 11:04 AM

hi, i´m getting this error when trying to capture lineage on a project ... it fails before my dispatcher is even called

steep-fountain-54482

04/12/2023, 11:04 AM

Copy code

23/04/12 10:29:22 ERROR SplineAgent: Unexpected error occurred during lineage processing for application: launcher #00f9a8uvf3tjqt09
java.lang.IllegalStateException: WithField.dataType should not be called.

bland-orange-13353

04/12/2023, 12:16 PM

This message was deleted.

✅ 1

bland-orange-13353

04/12/2023, 12:23 PM

This message was deleted.

✅ 1

wide-afternoon-79955

04/12/2023, 4:25 PM

Hi All, I am trying to push GMS pod logs to a mounted location hence,

Copy code

datahub-gms:
  extraEnvs:
    - name: LOG_DIR
      value: /tmp/datahub-gms/log/

but the logback file does not seems to pickup the env var LOG_DIR

hallowed-lizard-92381

04/12/2023, 6:20 PM

I’m seeing inconsistencies between the results returned via graphql call initiated from the webapp and that returned when executing query from igraphql. For example: Web/frontend shows ‘no role’ for these two users, but the graphql response shows ‘Admin’ role Anyone have similar experience or have a recommendation?

🔍 1

📖 1

cuddly-butcher-39945

04/12/2023, 6:56 PM

Hello Everyone. I've experienced an issue with Snowflake ingestion failing when it used to work. Here are my environment details: *****Environment********** Kubernetes deployment on AWS DataHub CLI version: 0.9.5 Python version: 3.7.10 (default, Jun 3 2021, 000201) [GCC 7.3.1 20180712 (Red Hat 7.3.1-13)] *****Ingestion Method********** I am trying both CLI and UI ingestions of my snowflake environment. *****Error********** datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type snowflake: snowflake is disabled; *****Debugging Step********** datahub check plugins --verbose snowflake (disabled) ModuleNotFoundError("No module named 'great_expectations.datasource.sqlalchemy_datasource'") *****Debugging Step********** pip3 list | grep SQLAlchemy Flask-SQLAlchemy 2.5.1 SQLAlchemy 1.4.40 SQLAlchemy-JSONField 1.0.0 SQLAlchemy-Utils 0.38.3 *****Debugging Step********** pip3 install 'acryl-datahub[sqlalchemy]' ---Requirement already satisfied: acryl-datahub[sqlalchemy] in /home/joshua.garza/.local/lib/python3.7/site-packages (0.9.5) *****Debugging Step********** pip3 install --upgrade great_expectations ---Requirement already satisfied: great_expectations in /home/joshua.garza/.local/lib/python3.7/site-packages (0.16.6) *****Debugging Step********** datahub check plugins --verbose ---snowflake (disabled) ModuleNotFoundError("No module named 'great_expectations.datasource.sqlalchemy_datasource'") Not sure what else to do here. Thanks in advance!

🔍 1

📖 1

✅ 1

elegant-salesmen-99143

04/12/2023, 8:06 PM

I have a working API query that gets me the name of the container and number of entities in it. But I also want to get the description of a container (aka Documentation). How do i get it? I've tried putting

description

under

name

in the query, but it returns

null

, even though documentation is not empty for this container. Is it called something different? the property like 'documentation' is not found

Copy code

{container(urn:"urn:li:container:XXX") {
  properties
  {name
  description
  } 
  entities{
    total
    start
  }
}}

📖 1

🔍 1

microscopic-room-90690

04/13/2023, 3:37 AM

Hi team, when ingesting hive metadata into datahub (v0.9.6.1), the execution log confuse me. It shows 42 tables are ingested in about 1min, while it takes 8min to ingest another 3 tables! Anyone can help?

Copy code

source:
  type: hive
  config:
    host_port: localhost:10000
    database_alias: hive
    schema_pattern:
      allow: ["^web_hudi$"]
        
sink:
  type: "datahub-rest"
  config:
    server: ${datahub_server}
    token: ${token}

📖 1

🔍 1

busy-analyst-35820

04/13/2023, 3:57 AM

Hi Team, Can anyone help us on this https://datahubspace.slack.com/archives/C029A3M079U/p1680678159704919

📖 1

✅ 1

🔍 1

better-fireman-33387

04/13/2023, 8:41 AM

Hi all, I am using Datahub with helm deployment and was moving it to use our own elastic instance (ver 7.17.3). though it’s working I’m getting some errors (inside the thread) also I can’t see any index template was created and my datahub usage event index name is

datahub_datahub_usage_event

(I set datahub prefix for all indices) could anyone assist please?

bland-orange-13353

04/13/2023, 10:28 AM

This message was deleted.

future-holiday-32084

04/13/2023, 10:30 AM

Hi Folks, I'm new to DataHub. When using DataHub Spark Lineage (io.acryldatahub spark lineage0.10.1-1) with a Spark job, it ingests lineage perfectly. However, in the MySQL DataHub database, the "createdby" field shows "urnlicorpuser:__datahub_system". As a result, I cannot remove the lineage manually through the DataHub UI. Could anyone please provide a solution? Additionally, when executing this write command

Copy code

spark.sql("select * from <database>.<table_source>").write.mode("append").format("parquet").saveAsTable("<database>.<table_sink>")

The lineage, as shown in the image below, has been inferred perfectly for the sink table. However, the source table displays the location on my Hadoop Data Lake, even though I'm reading from a table, not a path.