Hey everyone, I recently created a fresh install ...
# troubleshoot
b
Hey everyone, I recently created a fresh install of Datahub on Kubernetes. Everything started up fine and my team and I were able to ingest a bunch of Greenplum datasets, LDAP users & groups, and a number of glossary terms. However, now I'm trying to ingest an oracle database and running into a weird issue. The ingestion job completes successfully, but the ingested datasets fail to show up in the UI. I've checked in the MySQL database and the data is present there, so the entities are being created. I also ran the Elasticsearch re-indexing job which completed without error, but that did not produce any changes. We tried querying Elasticsearch itself to check for oracle data, but haven't been able to get it to play nice with the filters so we haven't been able to verify it's presence there. Has anyone run into something similar or have ideas on things we can do to diagnose and correct this issue?
g
Does the table counts incerease after ingesting the DB tables? If yes, you're likely not setting the
browse_paths
aspect of your entity correctly
b
The dataset count that's displayed on the datahub home page does not increase when I ingest new tables, it only shows the count of the original greenplum database I ingested.
I also do not believe it is an oracle vs greenplum issue since I've also tried ingesting a different greenplum database and have had issue with that as well. I don't believe any configuration changes were made between the successful greenplum ingestion and now.
@kind-dawn-17532 Tagging for easy discovery
g
can you see if your dataset_v2 index has documents in it?
also, do you have MAE consumer running?
it may be running as part of gms or standalone depending on your deployment model
does it have any logs?
b
We used Kibana to look at dataset_v2 and could see our greenplum tables, but not the oracle. We do not have a standalone MAE consumer, but looking at the logs for the gms pod there isn't much. It just standard stuff with repeats of the following:
Successfully fed bulk request. Number of events: 5 Took time ms: -1
request [POST <http://elasticsearch-master:9200/_bulk?timeout=1m>]
g
hmm ok- what happens if you issue a
*
query?
do the oracle* tables show up in search?
what happens if you navigate directly to an oracle datasets entity page?
id also be curious to see the output of your logs while the reindexing job is running.
do you just see Successfully fed bulk request.?
or are there any errors?
or are there no messages at all? that would be very helpful to understand
b
Sorry for the delay. I've addressed your questions below: - what happens if you issue a * query? Do the oracle* tables show up in search? Not sure exactly what you want me to query against. But searching in both the logs and datahub itself turned up nothing. Searching in the dataset_v2 index pulled up some greenplum items that used the word oracle in their properties. When I searched for
dataPlatform\:oracle*
in the index, I got no results. - what happens if you navigate directly to an oracle datasets entity page? This is not possible as no oracle datasets appear in the UI. - output of your logs while the reindexing job is running. Do you just see Successfully fed bulk request? Are there any errors? Are there no messages at all? I ran the reindexing job again and did not see any errors in either the GMS pod or the reindex job pod, but there were a few warnings. In the indexing pod, the logs look like the following:
2022-08-31T14:46:24.874918315Z Reading rows 335000 through 336000 from the aspects table.
2022-08-31T14:46:25.932621207Z Successfully sent MAEs for 336000 rows
In the GMS pod, while the job was running I continued to just see the "Successfully fed bulk request" messages, but after it finished I got the following MAE logs. After this, the longs went back to "Successfully fed bulk request."
Copy code
2022-08-31T14:47:51.062Z | 14:47:51.061 [I/O dispatcher 1] INFO c.l.m.s.e.update.BulkListener:28 - Successfully fed bulk request. Number of events: 5 Took time ms: -1
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:1100 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Failing OffsetCommit request since the consumer is not part of an active group
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:1100 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Failing OffsetCommit request since the consumer is not part of an active group
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] WARN o.a.k.c.c.i.ConsumerCoordinator:1041 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Synchronous auto-commit of offsets {MetadataChangeLog_Timeseries_v1-0=OffsetAndMetadata{offset=15, leaderEpoch=0, metadata=''}, MetadataChangeLog_Versioned_v1-0=OffsetAndMetadata{offset=160458, leaderEpoch=0, metadata=''}} failed: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] WARN o.a.k.c.c.i.ConsumerCoordinator:1041 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Synchronous auto-commit of offsets {MetadataChangeLog_Timeseries_v1-0=OffsetAndMetadata{offset=15, leaderEpoch=0, metadata=''}, MetadataChangeLog_Versioned_v1-0=OffsetAndMetadata{offset=160458, leaderEpoch=0, metadata=''}} failed: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:669 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:669 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:311 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Lost previously assigned partitions MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0
2022-08-31T14:47:54.438Z | 14:47:54.437 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:311 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Lost previously assigned partitions MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0
2022-08-31T14:47:54.438Z | 14:47:54.438 [ThreadPoolTaskExecutor-1] INFO o.s.k.l.KafkaMessageListenerContainer:292 - generic-mae-consumer-job-client: partitions lost: [MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0]
2022-08-31T14:47:54.438Z | 14:47:54.438 [ThreadPoolTaskExecutor-1] INFO o.s.k.l.KafkaMessageListenerContainer:292 - generic-mae-consumer-job-client: partitions revoked: [MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0]
2022-08-31T14:47:54.438Z | 14:47:54.438 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:552 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] (Re-)joining group
2022-08-31T14:47:54.438Z | 14:47:54.438 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:552 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] (Re-)joining group
2022-08-31T14:47:54.439Z | 14:47:54.439 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:455 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Join group failed with org.apache.kafka.common.errors.MemberIdRequiredException: The group member needs to have a valid member id before actually entering a consumer group
2022-08-31T14:47:54.439Z | 14:47:54.439 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:455 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Join group failed with org.apache.kafka.common.errors.MemberIdRequiredException: The group member needs to have a valid member id before actually entering a consumer group
2022-08-31T14:47:54.439Z | 14:47:54.439 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:552 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] (Re-)joining group
2022-08-31T14:47:54.439Z | 14:47:54.439 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:552 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] (Re-)joining group
2022-08-31T14:47:54.441Z | 14:47:54.441 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:604 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Finished assignment for group at generation 2367: {consumer-generic-mae-consumer-job-client-4-37ce43dd-14f5-4a94-8a15-17707cfede99=Assignment(partitions=[MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0])}
2022-08-31T14:47:54.441Z | 14:47:54.441 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:604 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Finished assignment for group at generation 2367: {consumer-generic-mae-consumer-job-client-4-37ce43dd-14f5-4a94-8a15-17707cfede99=Assignment(partitions=[MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0])}
2022-08-31T14:47:54.443Z | 14:47:54.442 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:503 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Successfully joined group with generation 2367
2022-08-31T14:47:54.443Z | 14:47:54.442 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:503 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Successfully joined group with generation 2367
2022-08-31T14:47:54.443Z | 14:47:54.443 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Adding newly assigned partitions: MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0
2022-08-31T14:47:54.443Z | 14:47:54.443 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Adding newly assigned partitions: MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0
2022-08-31T14:47:54.444Z | 14:47:54.444 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:792 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Setting offset for partition MetadataChangeLog_Versioned_v1-0 to the committed offset FetchPosition{offset=159958, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[prerequisites-kafka-0.prerequisites-kafka-headless.datagovernance.svc.cluster.local:9092 (id: 0 rack: null)], epoch=0}}
2022-08-31T14:47:54.444Z | 14:47:54.444 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:792 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Setting offset for partition MetadataChangeLog_Versioned_v1-0 to the committed offset FetchPosition{offset=159958, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[prerequisites-kafka-0.prerequisites-kafka-headless.datagovernance.svc.cluster.local:9092 (id: 0 rack: null)], epoch=0}}
2022-08-31T14:47:54.444Z | 14:47:54.444 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:792 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Setting offset for partition MetadataChangeLog_Timeseries_v1-0 to the committed offset FetchPosition{offset=15, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[prerequisites-kafka-0.prerequisites-kafka-headless.datagovernance.svc.cluster.local:9092 (id: 0 rack: null)], epoch=0}}
2022-08-31T14:47:54.444Z | 14:47:54.444 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:792 - [Consumer clientId=consumer-generic-mae-consumer-job-client-4, groupId=generic-mae-consumer-job-client] Setting offset for partition MetadataChangeLog_Timeseries_v1-0 to the committed offset FetchPosition{offset=15, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[prerequisites-kafka-0.prerequisites-kafka-headless.datagovernance.svc.cluster.local:9092 (id: 0 rack: null)], epoch=0}}
2022-08-31T14:47:54.445Z | 14:47:54.445 [ThreadPoolTaskExecutor-1] INFO o.s.k.l.KafkaMessageListenerContainer:292 - generic-mae-consumer-job-client: partitions assigned: [MetadataChangeLog_Timeseries_v1-0, MetadataChangeLog_Versioned_v1-0]
An additional finding is that when I try to create a new policy, it completes and displayed the "Successful" message, but the new policy does not show up in the UI either and it's permissions do not seem to be applied. We also tried creating some new PVCs and switching to those (effectively creating a new instance) in order to replicate the issue. We were able to ingest a bunch of data from multiple data sources, however the same issue did reoccur while ingesting ldap users and groups from a file source.
Here is the ingestion timeline I created by doing running a
datahub ingest list-runs
. I added a description to give some more info on what each run is. There are two things which are weird about this: 1. The
no-run-id-provided
run. I believe this corresponds to a Greenplum ingestion run that was stopped part way through. It says 98 rows created, by I can only find 3 in mysql that say no-run-id-provided and within the appropriate timeframe. However, these records have an incorrect platform ID, so other records may have been overwritten latter that day. 2. Any ingestion runs after the ldap ingestion do not show up in this table. However, if I go into mysql I can find entries with
"runId":"oracle-2022_08_24-19_24_06"
. Datahub deployment created: 2022-08-22 201617 (UTC) Ingestion runs from Datahub:
Copy code
+-----------------------------------------------+--------+---------------------------+---------------------+
| runId                                         |   rows | created at                | Description	       |
+===============================================+========+===========================+=====================+
| file-2022_08_24-18_49_19                      |  19593 | 2022-08-24 18:49:24 (UTC) | ldap users & groups |
+-----------------------------------------------+--------+---------------------------+---------------------+
| no-run-id-provided                            |     98 | 2022-08-24 18:12:06 (UTC) | Canceled greenplum  |
+-----------------------------------------------+--------+---------------------------+---------------------+
| datahub-business-glossary-2022_08_23-19_52_43 |  10857 | 2022-08-23 19:52:46 (UTC) | Glossary terms 	   |
+-----------------------------------------------+--------+---------------------------+---------------------+
| datahub-business-glossary-2022_08_23-19_51_43 |     45 | 2022-08-23 19:51:43 (UTC) | Glossary terms      |
+-----------------------------------------------+--------+---------------------------+---------------------+
| datahub-business-glossary-2022_08_23-19_51_20 |     36 | 2022-08-23 19:51:20 (UTC) | Glossary terms      |
+-----------------------------------------------+--------+---------------------------+---------------------+
| datahub-business-glossary-2022_08_23-19_51_02 |     48 | 2022-08-23 19:51:02 (UTC) | Glossary terms 	   |
+-----------------------------------------------+--------+---------------------------+---------------------+
| datahub-business-glossary-2022_08_23-19_50_30 |     86 | 2022-08-23 19:50:31 (UTC) | Glossary terms 	   |
+-----------------------------------------------+--------+---------------------------+---------------------+
| datahub-business-glossary-2022_08_23-19_49_43 |    208 | 2022-08-23 19:49:43 (UTC) | Glossary terms 	   |
+-----------------------------------------------+--------+---------------------------+---------------------+
| sqlalchemy-2022_08_23-19_26_59                |    178 | 2022-08-23 19:27:07 (UTC) | Greenplum schema	   |
+-----------------------------------------------+--------+---------------------------+---------------------+
| sqlalchemy-2022_08_23-15_59_03                | 115493 | 2022-08-23 16:15:55 (UTC) | Greenplum db 	   |
+-----------------------------------------------+--------+---------------------------+---------------------+
First oracle ingestion: 2022-08-24 192402 (UTC) Second oracle ingestion: 2022-08-25 130832 (UTC)
I can also tell, based on how many pages of users show up in datahub, that there are more CorpUserSnapshot items present in the ldap ingestion file than show up in the UI, and none of the CorpGroupSnapshot items appear. So the problem likely occurs at some point during the ldap ingestion. There are about 6550 users present in the UI out of 7002 CorpUserSnapshots in the ingestion file. With a new set of PVCs the issue occurred again while ingesting LDAP, however only about 1150 users can bee seen in the UI and still no groups. Because the same error has occured at different points in the file, I don't believe it's an issue with a particular entry and likely something more systematic. There were two groups with malformed urns that it logged warning for and then skipped during ingestion.
g
Hey @bland-balloon-48379 - it might help to jump on a quick call so I can have a look
are you free now to chat?
b
Hey @green-football-43791, sure that would be great. Now works for me
g
ok
b
Update on a couple things: I created new PVCs just for elasticsearch which effectively means I switched to a fresh elasticsearch install. Logging in right after start up, I could tell that some of the ingested ldap users were already present in the new ES since my name was shown in the town left corner. Navigating to the users page, there were about 100 users shown; this eventually increased to 167 before stopping. Looking at the logs right after startup I can see a couple errors:
Copy code
18:40:59.724 [pool-11-thread-1] ERROR c.l.d.g.a.service.AnalyticsService:264 - Search query failed: Elasticsearch exception [type=index_not_found_exception, reason=no such index [datahub_usage_event]]
18:40:59.725 [pool-11-thread-1] ERROR o.s.s.s.TaskUtils$LoggingErrorHandler:95 - Unexpected error occurred in scheduled task
java.lang.RuntimeException: Search query failed:
...
18:41:37.625 [pool-9-thread-1] ERROR c.d.m.ingestion.IngestionScheduler:243 - Failed to retrieve ingestion sources! Skipping updating schedule cache until next refresh. start: 0, count: 30
Using kibana I verified that no datahub_usage_event index exists in ES. Also, only the system_metadata_service_v1, corpuserindex_v2, & datahubpolicyindex_v2 indexes have any documents at all with 500, 167, & 11 respectively. I ran the reindexing job after this and it resulted in no change and no errors or meaningful messages in the logs.
@green-football-43791 I wanted to give some resolution to this thread as we were able to fix the problem! It seems that restarting Neo4j was able to get things unclogged and flowing into elastic search once again (although we were trying a bunch of things around that time, so it could have been a combination of factors). Here's a quick summary of the observations we made regarding this issue. Environment: Kubernetes deployment of datahub v0.8.41 with no standalone MAE consumer and using Neo4j as our graph DB. Issue: New items which are ingested appear in MySQL, but do not get loaded into ElasticSearch. Observations: • Issue occurred twice while ingesting LDAP users and groups from a file source to the REST sink. • Issue appears to have also occurred when ingesting an Oracle DB to the REST sink. (The original issue from the ldap file was not entirely cleared up so this needs to be verified). • The issue has never occurred while ingesting to the kafka sink. (we never ingested ldap via the kafka sink though). • The issue did not occur on datahub v0.8.35 or prior. • Reindexing Elasticsearch did nothing to resolve the issue. • Restarting neo4j appears to have resolved the issue and allows data to begin flowing into Elasticsearch again. (This needs to be replicated to verify). • Reindexing Elasticsearch also appears more effective after restarting neo4j. Results: It is unclear whether the issue was caused by ldap, the rest sink, or a combination of the two. Our hypothesis is that ingesting a heavy workload (like ldap) via the rest sink can cause things to freeze. However, restarting Neo4j (or other graph DB) and reindexing Elasticsearch should get things moving again.
g
Wow! Ok. So it sounds like neo4j failures were affecting elasticsearch too
I’m glad things were resolved
we’ll look into additional redundancy here so that graph db failures don’t affect search
k
Apparently there were no failures logged if any of the post requests to Neo4j failed.. so we are unsure if and what failure occurred on Neo4j
g
ok.
im sorry to hear that
if this happens again let me know we can try to dig deeper
k
We were able to get to all past Datahub related logs btw, but there wasn't much clue there either