DataHub #troubleshoot

proud-baker-56489

07/14/2022, 9:48 AM

hi team, for the normal dag in airflow, I did not import any package about datahub, but the task log show errors like this. Is there any update for the new version of datahub?

icy-portugal-26250

07/14/2022, 10:44 AM

I’m trying to query some metadata from the

/api/graphiql

endpoint. A query returned a response about an hour ago, but now when rerunning the query I get a

Copy code

{
  "errors": {
    "message": "Response.text: Body has already been consumed.",
    "stack": "graphQLFetcher/</</<@https://datahub.wolt.com/api/graphiql:57:33\n"
  }
}

Is there a way to fetch this response again?

quick-pizza-8906

07/14/2022, 2:06 PM

Hello, I found some issues when running 0.8.40 version dbt connector. To give some context: We have dbt workflows for Snowflake tables. Snowflake tables are ingested independently by Snowflake connector. We use only catalog and manifest yaml files for dbt connector. Now what are the issues: 1. If I run with

disable_dbt_node_creation

set to True - I can see nice lineage between preingested Snowflake tables but on the main page where all platforms are shown I can see DBT platform with count of several thousand elements. If I click on this platform to see entities I got an exception. After some examination of mysql database I could see there are objects with urn like

urn:li:assertion:2c8a2605354d9b924c0f1b5d9f0dffd5

with dataPlatformInstance apsect having

dbt

as platform but nothing as an instance (I believe exception was coming from that aspect missing platform instance). 2. If I run with

disable_dbt_node_creation

set to False - I can see lineage and dbt objects combined with Snowflake tables (very cool). It seems I still have above assertions but they don't cause problems on platform search anymore. In either case if I run connector with

stateful_ingestion

enabled I end up with connector ingesting data but then throwing an exception ending with code like below:

Copy code

File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/state/sql_common_state.py", line 35, in _get_lightweight_repr
    31   def _get_lightweight_repr(dataset_urn: str) -> str:
    32       """Reduces the amount of text in the URNs for smaller state footprint."""
    33       SEP = BaseSQLAlchemyCheckpointState._get_separator()
    34       key = dataset_urn_to_key(dataset_urn)
--> 35       assert key is not None
    36       return f"{key.platform}{SEP}{key.name}{SEP}{key.origin}"
    ..................................................
     dataset_urn = 'urn:li:assertion:2c8aaaa5354d9b924c0f1b5c9f09bf75'
     SEP = '||'
     key = None

Which makes me think urn representation function fails for assertion objects which are considered to be datasets somehow? Anyone having similar problems?

best-lamp-53937

07/14/2022, 2:22 PM

Is there a query that would return the entire schema of the GraphQL API? Or one that would return all entities in DataHub? Perhaps one that would return all entities for a given Domain?

prehistoric-yak-55672

07/14/2022, 8:41 PM

Hello everyone, first time here! I'm trying to initialize datahub locally on a windows machine, but when I run

datahub docker quickstart

It returns the following error:

Copy code

---- (full traceback above) ----
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\datahub\entrypoints.py", line 149, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\datahub\upgrade\upgrade.py", line 322, in wrapper
    res = func(*args, **kwargs)
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\datahub\telemetry\telemetry.py", line 338, in wrapper
    raise e
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\datahub\telemetry\telemetry.py", line 290, in wrapper
    res = func(*args, **kwargs)
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\datahub\cli\docker.py", line 322, in quickstart
    default_quickstart_compose_file = _get_default_quickstart_compose_file()
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\datahub\cli\docker.py", line 162, in _get_default_quickstart_compose_file
    home = os.environ["HOME"]
File "c:\users\wohar\appdata\local\programs\python\python37-32\lib\os.py", line 681, in __getitem__
    raise KeyError(key) from None

KeyError: 'HOME'
[2022-07-14 17:36:08,451] INFO     {datahub.entrypoints:188} - DataHub CLI version: 0.8.40.3 at c:\users\wohar\appdata\local\programs\python\python37-32\lib\site-packages\datahub\__init__.py
[2022-07-14 17:36:08,451] INFO     {datahub.entrypoints:191} - Python version: 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:01:55) [MSC v.1900 32 bit (Intel)] at c:\users\wohar\appdata\local\programs\python\python37-32\python.exe on Windows-10-10.0.22000-SP0
[2022-07-14 17:36:08,451] INFO     {datahub.entrypoints:193} - GMS config {}

Does anyone knows what might be happening?

flat-window-44654

07/14/2022, 10:51 PM

Hi there, I'm querying

SearchAcrossEntities

endpoint trying to return only results for

DASHBOARDS

and

DATASETS

. However, when I submit the following query (see 🧵) with both types, I only get back

DATASETS

, even though I know there are

DASHBOARDS

that match my search query. Could there be a bug in the API or am I missing something? 🤔

adamant-van-21355

07/15/2022, 7:48 AM

https://datahubspace.slack.com/archives/C029A3M079U/p1657698691218499

better-spoon-77762

07/15/2022, 8:05 PM

Hello, Can someone pls share some examples of paginating thru graphQL results for search query?

delightful-barista-90363

07/15/2022, 10:30 PM

Hello, apologies for asking late on a friday (and the answer can wait) But i am getting this error on a spark job when trying to use the DatahubSparkListe

Copy code

DatahubSparkListener: java.lang.NullPointerException: Cannot invoke "java.util.Map.put(Object, Object)" because the return value of "java.util.Map.get(Object)" is null

was wondering if i could get some assistance? Stacktrace(s) in thread. Thanks for the help in advanced

most-nightfall-36645

07/18/2022, 8:51 AM

Hi when I try to upgrade to datahub

v0.8.41

my frontend and gms containers error with:

Copy code

Error: secret "datahub-auth-secrets" not found

How do I create this secret from the datahub helm chart (e.g. which pod/container creates the secret).

purple-analyst-83660

07/18/2022, 10:21 AM

Hi All, I am trying to ingest metadata corresponding to a project. I get _NODE_LIMIT_EXCEEDED_ error first, when I try to include _page_size: 5_. I get this error. Can any body help? (Have attached the config yaml that I am using)

agreeable-belgium-70840

07/18/2022, 10:27 AM

hello, I recently updated to v0.8.40 . The problem that I am facing is that I can't create new groups and I can't add new users in existing groups. I am getting a message that the group was created and the graphql in the developer's tools responds 200. This is what I am getting:

Copy code

{data: {createGroup: "urn:li:corpGroup:4404d005-a2f6-491f-8b4d-931c7063ea0a"}, extensions: {}}
data: {createGroup: "urn:li:corpGroup:4404d005-a2f6-491f-8b4d-931c7063ea0a"}
createGroup: "urn:li:corpGroup:4404d005-a2f6-491f-8b4d-931c7063ea0a"
extensions: {}

Any ideas?

square-hair-99480

07/18/2022, 4:08 PM

Hello dear friends, I am ingesting data from Snowflake and my boss asked my if it was possible to use mfa with out datahub user. If I activate the mfa it keeps asking me to fill it up multiple times during the ingestion. I took a look here https://datahubproject.io/docs/generated/ingestion/sources/snowflake/#prerequisites and it seems that the most reasonable solution would actually be to use a key pair authentication. Nevertheless if there is a way with mfa and you know it could you share please?

faint-translator-23365

07/18/2022, 8:01 PM

When I am trying to configure OIDC in datahub-frontend. I'm getting this error. Can someone please help? Slack Conversation

rhythmic-stone-77840

07/19/2022, 12:42 AM

Hey all - I'm using GraphQL and am having trouble setting up a filter for downstream/upstream lineage. I'd like to pull out all datasets that have an upstream lineage of 0, but I don't understand how to get the filter to work for this. Current query in 🧵

clean-tomato-22549

07/19/2022, 4:59 AM

Error in using lookml connector

plus1 1

icy-portugal-26250

07/19/2022, 7:19 AM

Validation errors pop up in the Dathahub’s UI following update to

v0.8.41

witty-butcher-82399

07/19/2022, 12:52 PM

I have a bigquery connector instance failling with the following error:

Copy code

│ PermissionDenied: 403 request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/XXXXXXXX'

According to the docs, that permission is required only for lineage. So I tried by disabling table lineage with:

include_table_lineage: False

However, still getting the same error. Is there any other config setting for disabling the table lineage? or is this a bug in the config field? 🧵

bland-orange-13353

07/19/2022, 4:04 PM

This message was deleted.

delightful-barista-90363

07/19/2022, 10:12 PM

gonna bump this for help https://datahubspace.slack.com/archives/C029A3M079U/p1657924216360099

hallowed-dog-79615

07/20/2022, 8:00 AM

Greetings Team: We have been testing a bit Glossary Terms ingestion and we have found some unexpected behavior. Let's go through some steps: 1.- It does not matter if we create A Glossary Term in the UI before adding it massively through a CSV ingestion. But let's say we create it. We create a term called "Active_users". 2.- We add some documentation to our just-created term. Again this is not mandatory but helps identify the issue later. 3.- We proceed to add the term to several dataset objects. For this we leverage the CSV ingestion feature. We prepare our CSV following the guidelines in the documentation. We ingest the CSV. And the term "Active_users" is added to our datasets! It seems it worked. 4.- But then if we go to a dataset's entity page and click on the "Active_users" term badge, so we access its own entity page. And there we see that the documentation we added is missing. 5.- Then we start playing around and realize that the term "Active_users" is duplicated. There are two different entity pages: the one of the term we created manually (urnliglossaryTerm:82a86728-087a-4232-bfbe-5a9a2790f6ce), and the one of the term we added through CSV (urnliglossaryTerm:Active_users). As you see, their ids are quite different. 6.- Not only that. In the Glossary terms menu, the ingested term is not even visible, we can only access its page through other entities badges. The manually created one is of course in the list, but nothing appears under "Related entities". 7.- Even more, we realized we are not able to delete the ingested term. We cannot even remove it from datasets. If we try to remove it, it says "Successfully removed", but then the term still there when you refresh. We understand this is a bug, even if we were missing some step in which we had to associate the ingested term with an already existing one, having both the same name, not being able to delete or access a term does not seem like a desired behavior. I apologize if this have been reported elsewhere, I found Glossary term bugs but they didn't reach the "not being able to delete part". Thanks!! Dani

microscopic-mechanic-13766

07/20/2022, 8:17 AM

Hi, I am deploying datahub v0.8.41 in docker 20.10.17 and have found one thing that I don't know if it is intended to be this way but it doesn't make much sense as far as I know. The thing is that in the 3 basic services needed for the deployment (gms, frontend and actions) the user used to log in is not the same. For example: in datahub-gms is

uid=101(datahub) gid=101(datahub) groups=101(datahub)

but in the datahub-front-end is

uid=100(datahub) gid=101(datahub) groups=101(datahub)

. Is this done on purpose or is it just a mistake?? Thanks in advance for the help!

steep-soccer-91284

07/20/2022, 9:24 AM

datahub-gms is not running

best-leather-7441

07/20/2022, 1:03 PM

hi ! i hope this is the right thread for this question. After i ingest my metadata everything is fine, but if i shut down my console and restart datahub all my data,ingestions,groups etc vanishes... am i missing something ? thank you for you time

lemon-engine-23512

07/20/2022, 2:00 PM

Hello team. Am trying to deploy datahub to aws, in my org we cannot use helm as we have no access to k8 cluster. We can only build images and push it. Can anyone assist me on this?

prehistoric-yak-55672

07/20/2022, 4:49 PM

Hello everyone. Is there a way to create a backup of all progress I made on Datahub? As an example, I would to backup all documentation I wrote for each dataset I have, in case somethin happens

shy-parrot-64120

07/20/2022, 5:55 PM

Hi all We are trying to migrate backend database from mysql to postgres are there any ways to preserve all data ?

bland-orange-13353

07/20/2022, 6:06 PM

This message was deleted.

big-ocean-9800

07/20/2022, 6:12 PM

Hey folks! We are currently running datahub @

v0.8.38

and we have about 7k data assets loaded. We are seeing a pattern where loading the home page is extremely slow (on the order of 5-10 seconds). I checked metrics around our datahub infrastructure and everything was running at about 10-20% utilization. Our elastic search cluster is at low utilization, their disks are less than 10% utilized, and I don’t see any IO throttling from our cloud provider. Same story with our Postgres instance. I took a look at the calls that hang the longest on the home page and the consistently slow call is the graphql call

searchAcrossEntities

. By taking a cursory look through the code, I can see that it seems to interact with just elastic search. I’m here wondering if anyone has experienced a similar behavior, any troubleshooting tips, etc. Is this expected performance with the number of assets we have? Are there any changes we can make to our elastic cluster to help alleviate these problems? I took a look through the slack history through this channel and couldn’t quite find any messages which seem similar (same with github issues both open and closed). Please let me know if any more information would be helpful. Cheers!

ambitious-cartoon-15344

07/21/2022, 8:16 AM

Hi , I use Metadata Service Authentication, Whether the Airflow lineage plugin cannot be used. I don't see anything about token in the Airflow plugin.