DataHub #troubleshoot

ambitious-guitar-89068

02/17/2022, 5:07 AM

Hi, Is there a way to have Sub-domain to a domain?

brave-market-65632

02/17/2022, 5:44 AM

Hi - I came back to check out datahub after a hiatus. I pulled the latest docker image and did a quickstart. When I tried an ingestion (Snowflake source) that used to work earlier, it fails with the following error. Any leads? As such I verified the ingestion configuration with the examples and nothing jumps out.

Copy code

2022-02-17 10:54:02,004] ERROR    {datahub.entrypoints:119} - Stackprinter failed while formatting <FrameInfo /usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py, line 221, scope SQLAlchemyConfig>:
  File "/usr/local/lib/python3.9/site-packages/stackprinter/frame_formatting.py", line 224, in select_scope
    raise Exception("Picked an invalid source context: %s" % info)
Exception: Picked an invalid source context: [221], [192], dict_keys([192, 193])

So here is your original traceback at least:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 77, in run
    pipeline = Pipeline.create(pipeline_config, dry_run, preview)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 175, in create
    return cls(config, dry_run=dry_run, preview_mode=preview_mode)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 120, in __init__
    source_class = source_registry.get(source_type)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 126, in get
    tp = self._ensure_not_lazy(key)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 84, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path
    item = importlib.import_module(module_name)
  File "/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/snowflake.py", line 29, in <module>
    from datahub.ingestion.source.sql.sql_common import (
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 206, in <module>
    class SQLAlchemyConfig(StatefulIngestionConfigBase):
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 221, in SQLAlchemyConfig
    from datahub.ingestion.source.ge_data_profiler import GEProfilingConfig
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 27, in <module>
    from great_expectations.core.util import convert_to_json_serializable
  File "/usr/local/lib/python3.9/site-packages/great_expectations/__init__.py", line 7, in <module>
    from great_expectations.data_context import DataContext
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/__init__.py", line 1, in <module>
    from .data_context import BaseDataContext, DataContext, ExplorerDataContext
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/data_context.py", line 23, in <module>
    from great_expectations.rule_based_profiler.config.base import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/rule_based_profiler/__init__.py", line 1, in <module>
    from .rule_based_profiler import RuleBasedProfiler
  File "/usr/local/lib/python3.9/site-packages/great_expectations/rule_based_profiler/rule_based_profiler.py", line 16, in <module>
    from great_expectations.rule_based_profiler.domain_builder.domain_builder import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/rule_based_profiler/domain_builder/__init__.py", line 1, in <module>
    from .domain_builder import DomainBuilder  # isort:skip
  File "/usr/local/lib/python3.9/site-packages/great_expectations/rule_based_profiler/domain_builder/domain_builder.py", line 6, in <module>
    from great_expectations.rule_based_profiler.types import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/rule_based_profiler/types/__init__.py", line 3, in <module>
    from .domain import (  # isort:skip
  File "/usr/local/lib/python3.9/site-packages/great_expectations/rule_based_profiler/types/domain.py", line 8, in <module>
    from great_expectations.execution_engine.execution_engine import MetricDomainTypes
  File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/__init__.py", line 4, in <module>
    from .sqlalchemy_execution_engine import SqlAlchemyExecutionEngine
  File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/sqlalchemy_execution_engine.py", line 97, in <module>
    import pybigquery.sqlalchemy_bigquery
  File "/usr/local/lib/python3.9/site-packages/pybigquery/sqlalchemy_bigquery.py", line 32, in <module>
    from google.cloud.bigquery import dbapi
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery/__init__.py", line 35, in <module>
    from google.cloud.bigquery.client import Client
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 70, in <module>
    from google.cloud.bigquery import _pandas_helpers
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 67, in <module>
    from google.cloud.bigquery import schema
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery/schema.py", line 20, in <module>
    from google.cloud.bigquery_v2 import types
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery_v2/__init__.py", line 18, in <module>
    from .types.encryption_config import EncryptionConfiguration
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery_v2/types/__init__.py", line 16, in <module>
    from .encryption_config import EncryptionConfiguration
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery_v2/types/encryption_config.py", line 26, in <module>
    class EncryptionConfiguration(proto.Message):
  File "/usr/local/lib/python3.9/site-packages/proto/message.py", line 200, in __new__
    file_info = _file_info._FileInfo.maybe_add_descriptor(filename, package)
  File "/usr/local/lib/python3.9/site-packages/proto/_file_info.py", line 42, in maybe_add_descriptor
    descriptor=descriptor_pb2.FileDescriptorProto(
TypeError: descriptor to field 'google.protobuf.FileDescriptorProto.name' doesn't apply to 'FileDescriptorProto' object

witty-butcher-82399

02/17/2022, 9:52 AM

Hi! I’m doing some tests with the profiling in the redshift connector and I found the following: filtering out (deny) some tables in the

table_pattern

does not prevent them to be profiled, unless they are also filtered out (deny) in the

profile_pattern

too. Is that the expected behaviour? Shouldn’t denied tables for ingestion also be denied for profiling? If I’m not wrong, this is noted here https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py#L1089-L1092 where table pattern is not included. Thanks!

abundant-lizard-52842

02/17/2022, 10:28 AM

👋 Hi everyone! Good Day! I have setup Datahub in EKS for POC but i am not to change the default password, could you please guide.....how to reset default password.

brave-secretary-27487

02/17/2022, 4:40 PM

Hello everyone, I can't get Metadata Service Authentication working with a kubernetes deployment. The values that I overwrite in the helm upgrade are the following

Copy code

datahub-frontend:
  extraEnvs:
    - name: METADATA_SERVICE_AUTH_ENABLED
      value: "true"
  datahub:
    metadata_service_authentication:
      enabled: true
  
datahub-gms:
  extraEnvs:
    - name: METADATA_SERVICE_AUTH_ENABLED
      value: "true"
  datahub:
    metadata_service_authentication:
      enabled: true

global:
  datahub:
    metadata_service_authentication:
      enabled: true
      systemClientId: "__datahub_system"
      systemClientSecret:
        secretRef: "datahub-auth-secrets"
        secretKey: "token_service_signing_key"
      tokenService:
        signingKey:
          secretRef: "datahub-auth-secrets"
          secretKey: "token_service_signing_key"
      # Set to false if you'd like to provide your own auth secrets
      provisionSecrets: true

I'm, not sure which ENV variable to overwrite as there is a reference in global and in both front-end and GMS. I'm also unsure how to retrieve

systemClientId

Under the global key. I have deployed this with helm but I was still able to make unauthenticated requests. I have no clue what I am missing.

numerous-eve-42142

02/17/2022, 7:20 PM

Hi! I Have some tables in redshift that only are used to backup and it's not desired to show then on datahub. This tables have a "_$old" final. How could i filter them using my ingestion recipe? PS: This tables existes in all my schemas. This is my test ingestion recipe:

Copy code

source:
  type: redshift
  config:
    # Coordinates
    host_port: *******
    database: *******

    # Credentials
    username: ******
    password: ******

    # Options
    include_tables: True 
    include_table_lineage: True
    include_views: False
    
    # Autorization
    schema_pattern:
      allow:
        - "wine_stg"

    profiling:
      enabled: true
      include_field_null_count: true
      include_field_min_value: true
      include_field_max_value: true
      include_field_mean_value: true
      include_field_median_value: false
      include_field_histogram: false

sink:
  # sink configs
  type: "datahub-rest"
  config:
    server: "******"

glamorous-house-64036

02/18/2022, 12:07 AM

Getting really strange issues with quickstart after laptop restart. That's how it looks now:

Copy code

ERROR: for mysql  Cannot start service mysql: error while creating mount source path '/mysql/init.sql': mkdir /mysql: read-only file system
ERROR: Encountered errors while bringing up the project.

damp-minister-31834

02/18/2022, 3:40 AM

Hi all! how to delete just data of just one aspect? For example, a dataset in datahub has many aspect data like browsePaths, schemaMetadata, globalTags, etc. I want to delete one aspect, what should I do. I see the command "datahub delete --urn xxx" will delete all aspects.

few-air-56117

02/18/2022, 9:19 AM

HI guys, i think i found a bug on ui ingestion. I start an ingestion at

Copy code

Runs at 02:20 am (America/New_York)

now, after 2 hours, in ui its look like its still running so i check the logs and its says that its done

Copy code

tables_scanned': 323,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'views_scanned': 337,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'entities_profiled': 327,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'filtered': [],
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'soft_deleted_stale_entities': [],
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'query_combiner': {'total_queries': 10107,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'uncombined_queries_issued': 4765,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'combined_queries_issued': 603,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'queries_combined': 6298,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'query_exceptions': 11}}
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
Sink (datahub-rest) report:
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
{'records_written': 775,
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'warnings': [],
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'failures': [],
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'downstream_start_time': datetime.datetime(2022, 2, 18, 7, 22, 37, 552032),
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'downstream_end_time': datetime.datetime(2022, 2, 18, 7, 35, 23, 470652),
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
 'downstream_total_latency_in_seconds': 765.91862}
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
{}
Info
2022-02-18 09:35:26.496 EET
acryl-datahub-actions
Pipeline finished with failures

brave-secretary-27487

02/18/2022, 2:36 PM

I want to install acryl-datahub[bigquery] on airflow composer as a pip dependancy in GCP but the the package is depended on black. Why would it depend on a formatter? is this a bug?

Copy code

Failed to install PyPI packages. black 22.1.0 has requirement click>=8.0.0, but you have click 7.1.2.
 Check the Cloud Build log at <https://console.cloud.google.com/cloud-build/builds/0xxxxxxxx?project=xxxxxxxx> for details. For detailed instructions see <https://cloud.google.com/composer/docs/troubleshooting-package-installation>

breezy-portugal-43538

02/18/2022, 2:53 PM

Hello, I am facing some strange issue when running the ingestion script within the virtualBox Ubuntu guest system. After executing the ingest command I end up with exception thrown stating that the ENV variables for spark an java are undefined:

Copy code

[2022-02-18 14:30:10,541] ERROR    {logger:26} - Please set env variable SPARK_VERSION
JAVA_HOME is not set
[2022-02-18 14:30:10,896] ERROR    {datahub.entrypoints:119} - File "/usr/local/lib/python3.8/site-packages/datahub/cli/ingest_cli.py", line 77, in run
    67   def run(config: str, dry_run: bool, preview: bool, strict_warnings: bool) -> None:
 (...)
    73       pipeline_config = load_config_file(config_file)
    74   
    75       try:
    76           logger.debug(f"Using config: {pipeline_config}")
--> 77           pipeline = Pipeline.create(pipeline_config, dry_run, preview)
    78       except ValidationError as e:
[...]

Because of the exception I am unable to use ingestion process properly for my own metadata setup. Could you help resovle this issue? 1. In order to reproduce please use following yml (example.yml) file:

Copy code

source:
  type: data-lake
  config:
    env: "PROD"
    platform: "dataLake"
    base_path: "load"
    profiling:
      enabled: true
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

2. Then simply run command within the metadata-ingestion folder as:

./metadata-ingestion/scripts/datahub_docker.sh ingest -c example.yml

If it will be required I can also paste the whole error log from my faulting operation.

numerous-application-54063

02/18/2022, 3:19 PM

Hi guys, I'm testing the bigquery ingestion, and for metadata works fine but i don't get any lineage or usage info. the cli (version 0.8.26.3) says parsing 0 rows from logs. So i tried to execute the log filter on gcp console and when i filter with this condition:

protoPayload.serviceName="<http://bigquery.googleapis.com|bigquery.googleapis.com>"

i do get logs. but if i add the second condition taken from the connector, no logs are found:

protoPayload.methodName="jobservice.jobcompleted"

Instead in my logs the method name is formatted like this:

methodName: "google.cloud.bigquery.v2.JobService.InsertJob"

any advice on this one? thanks!

cuddly-engine-66252

02/20/2022, 11:44 AM

Hi all! Question about Google Authentication for React App (OIDC) Is it possible to specify the

prompt=select_account

authentication url parameter in docker.env for the frontend-react container? So that if a non-company account is selected, it would be possible to choose it, and not face 403, thank you.

cuddly-engine-66252

02/20/2022, 12:11 PM

‼️~The same situation on the demo site~‼️ ~~http://demo.datahubproject.io/~~ Another problem: It looks like when adding an owner for a dataset, the information about it is not updated on the page

https://{frontend_link}/user/urn:li:corpuser:{user}/assets

(new ones do not appear, deleted ones are not deleted) v0.8.26 UPD: After 10-15 minutes, the data is finally updated. But not immediately

damp-minister-31834

02/21/2022, 7:01 AM

Hi all! How to attach tag to container via rest api not ui? Anybody knows?

gifted-piano-21322

02/21/2022, 8:30 AM

Hello. I'm checking your demo page and was curious on how datahub shows s3 schemas, but seems like the few examples i reviewed (like this: https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:s3,datahubpro[…]ct_splits.all_entities,PROD)/Schema?is_lineage_mode=false) does not have it. Does it mean that DataHub does not store s3 schemas in general or do i have to use glue for that?

high-hospital-85984

02/21/2022, 9:33 AM

Any tips on debugging issues where , in this case, a dataFlow exists in the GMS database, but does not show up in the UI?

high-hospital-85984

02/21/2022, 9:55 AM

We ingest quite a lot of looker dashboard elements, and started to see some issues regarding that in the GMS logs:

Copy code

07:10:21.041 [qtp544724190-10] INFO  c.l.metadata.entity.EntityService:681 - INGEST urn urn:li:chart:(looker,dashboard_elements.21845) with system metadata {lastObserved=1645427421035, runId=looker-2022_02_21-07_10_08}
07:10:21.053 [qtp544724190-10] INFO  c.l.m.filter.RestliLoggingFilter:56 - POST /entities?action=ingest - ingest - 500 - 12ms
07:10:21.053 [qtp544724190-10] ERROR c.l.m.filter.RestliLoggingFilter:38 - com.datahub.util.exception.RetryLimitReached: Failed to add after 3 retries

Is this a problem with the GMS database?

boundless-student-48844

02/21/2022, 12:10 PM

Hi team, I am trying to add a new entity to the metadata model (called

MLFeatureV2

to support our ML discoverability use cases). I’ve done the changes on Pegasus, GraphQL, Java and React and it has successfully built with

./gradlew build

. I tried to ingest to gms and it worked. The aspect values are successfully stored in MySQL (as seen in screenshot). However, when I navigate to the UI, the

getSearchResults

graphql request (second screenshot) for the new entity has below error.

Copy code

{
  "errors": [
    {
      "message": "The field at path '/search/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'",
      "path": [
        "search",
        "searchResults",
        0,
        "entity"
      ],
      "extensions": {
        "classification": "NullValueInNonNullableField"
      }
    }
  ],
  "data": {
    "search": null
  }
}

Do you know what’s missing here?

strong-iron-17184

02/21/2022, 2:46 PM

Hello, I have a problem when executing docker-compose up in airflow, I followed the documentation on datahub on how to install but I have problems in a container that is unhealthy and does not show the docker on the web. what is happening? I have had this problem for 1 month and nobody from the community knows how to help me, could it be that the documentation is wrong?

alert-teacher-6920

02/21/2022, 9:34 PM

How do you specify a downstream relationship for an entity? For example if entity A has a downstream entity B, can I somehow specify A’s downstream is B by making a proposal for entity A? I tried with the Java API, but I couldn’t figure out how to get it to accept downstream as an aspect, but it’s possible I was doing it wrong. Or is it only possible to say B has upstream of A? And if so, is there a reason we can only specify upstreams? I have created an API for a proprietary platform that when I detect a metadata change, I can generate proposals for a given entity, but it seems like I might need to muddy the API to say if metadata changes for some entity, I have to make some proposals for that changed entity (good!) and have to also support changes to other related entities (less ideal 😢).

bland-orange-13353

02/22/2022, 11:04 AM

This message was deleted.

hallowed-gpu-49827

02/22/2022, 1:44 PM

Hello! I’m having trouble enabling metadata service authentication. I’ve added env variable

METADATA_SERVICE_AUTH_ENABLED=true

to frontend AND gms. I restarted them and logged out/in again. I receive this error in gms logs showing that it’s not sending the token over as it should:

Copy code

com.datahub.authentication.AuthenticationException: Failed to authenticate inbound request: Authorization header is missing 'Basic' prefix.

👋 1

strong-iron-17184

02/22/2022, 2:50 PM

Hello, when starting airflow in datahub everything goes well, but it does not start the interface in the browser on port 58080. the port is on and when I run the command wget localhost:58080 it returns me the html of airflow. what is happening?

alert-teacher-6920

02/22/2022, 9:53 PM

Good evening! Noticed there were some new FabricType values for environments like TEST, which seems useful, but version 0.8.24 of the Java datahub-client library doesn’t have those values. It looks like maybe they were added in 0.8.25 when I look at the release notes. But when I go to the maven repository linked on this Java Emitter guide on the DataHub docs site, I don’t see that version, or really most versions’ releases for that matter. Are new versions of the client not published regularly or are they available elsewhere?

ancient-pillow-45716

02/23/2022, 7:35 AM

Hi everyone,i have a trouble about pattern_add_dataset_tags,when I'm trying to create batch tags by yaml file,but it will overwrite manual exists tags respective entities?

plus1 1

few-air-56117

02/23/2022, 7:53 AM

Hi guys , i try to resintall datahub using helm

Copy code

ehlm install datahub  datahub/datahub  -f helm_custom_settings_custom_helm.yaml

but i have this error

Copy code

Error: INSTALLATION FAILED: failed pre-install: timed out waiting for the condition

salmon-area-51650

02/23/2022, 12:24 PM

I’m trying to deploy DataHub in AWS using MSK. Is it possible to configure DataHub to work with SASL_SCRAM authentication for MSK. Could you provide an example for

values.yaml

? Thanks in advance!

mysterious-butcher-86719

02/23/2022, 2:25 PM

Hi Team, I am looking for the details around the database and schema level metadata. I see that we have the datasets formed with the dataplatform and then the usual db.schema.tablename , which has the metadata details at the table level. For example, I want to retrieve the definition/About for the database as well as DB schemas. Is this possible in DataHub and if Yes, could you please guide on the path for the reference.

mysterious-butcher-86719

02/23/2022, 2:51 PM

Also, As per Metadata model for Dataset, the name formed based on the type of the platform. Is there a list of platforms and the way the name is formed? The document says "Usually, names are composed by combining the structural elements of the name and separating them by

. e.g. relational datasets are usually named as

<db>.<schema>.<table>

, except for platforms like MySQL which do not have the concept of a `schema`; as a result MySQL datasets are named

<db>.<table>

. In cases where the specific platform can have multiple instances (e.g. there are multiple different instances of MySQL databases that have different data assets in them), names can also include instance ids, making the general pattern for a name

<platform_instance>.<db>.<schema>.<table>