Hi I m getting some errors while browsing the frontend but w DataHub #troubleshoot

Hi! I'm getting some errors while browsing the fro...

gentle-camera-33498

08/10/2022, 2:22 PM

Hi! I'm getting some errors while browsing the frontend, but when I open the networking tab there is no problem with the calls to the backend. Does anyone have an idea what it could be?

gentle-camera-33498

08/10/2022, 2:23 PM

The GMS server does not have any error log message

gentle-camera-33498

08/10/2022, 2:35 PM

The problem persists even running the restore indices job and refreshing the page. But when I run CTRL+F5, it works...

bulky-soccer-26729

08/10/2022, 3:19 PM

hey Patrick! so things are working now with ctrl + f5 for you? or are you still seeing issues?

bulky-soccer-26729

08/10/2022, 3:21 PM

even though the responses were all 200 in your network tab, that doesn't mean they were successful since graphql will return a 200 http response, but in the result payload it will contain errors and an error status

bulky-soccer-26729

08/10/2022, 3:22 PM

and it looks like it was a 500 error, so for some reason your server was down. had you just restarted or booted things up per chance?

gentle-camera-33498

08/10/2022, 3:52 PM

After using CRTL_F5 it works. However, when I leave the session open for a long time and try to use it again, these problems happen again.

gentle-camera-33498

08/10/2022, 3:53 PM

even though the responses were all 200 in your network tab, that doesn't mean they were successful since graphql will return a 200 http response, but in the result payload it will contain errors and an error status

Hmm, I didn't know about that. It would be incredible if I could catch these errors in logs

bulky-soccer-26729

08/10/2022, 3:54 PM

okay gotcha gotcha.. how long would a long time be if you had to guess?

gentle-camera-33498

08/10/2022, 3:56 PM

and it looks like it was a 500 error, so for some reason your server was down. had you just restarted or booted things up per chance?

No, I deployed the Helm chart 44 hours ago. Everything works after a full page refresh. But, as I said, these errors come back after a long idle period.

gentle-camera-33498

08/10/2022, 3:57 PM

okay gotcha gotcha.. how long would a long time be if you had to guess?

I would guess about 2 hours of downtime

bulky-soccer-26729

08/10/2022, 3:58 PM

okay thanks! I'll see if I can try and reproduce and look into this some. You said you were looking at your GMS logs after receiving these error messages and you didn't see anything in there?

gentle-camera-33498

08/10/2022, 4:01 PM

Yes. I deployed the helm chart in Google Kubernetes Engine and used Cloud Monitoring to filter logs. Se bellow the filter that I'm using:

resource.type="k8s_container"

resource.labels.cluster_name="dsc-prod"

resource.labels.namespace_name="prod"

resource.labels.pod_name:"datahub-gms-server"

severity=(ERROR OR WARNING OR CRITICAL OR ALERT OR EMERGENCY)

gentle-camera-33498

08/10/2022, 4:05 PM

I tried to increase the searchService batch return size and enabled cache, but it just increased the memory consumption

gentle-camera-33498

08/10/2022, 4:06 PM

Do you have any idea of what could cause this?

bulky-soccer-26729

08/10/2022, 4:34 PM

ahh so this might be a performance thing then

bulky-soccer-26729

08/10/2022, 4:34 PM

do you see any logs in your elasticsearch pod?

bulky-soccer-26729

08/10/2022, 4:35 PM

it looks like it's failing on search specifically

gentle-camera-33498

08/10/2022, 4:40 PM

Let me check the elasticsearch logs

gentle-camera-33498

08/10/2022, 4:48 PM

Strange... with kubectl logs command, I found this error:

{"type": "server", "timestamp": "2022-08-07T155903,462Z", "level": "WARN", "component": "o.e.c.NodeConnectionsService", "cluster.name": "datahub-dev-elasticsearch", "node.name": "datahub-dev-elasticsearch-master-0", "message": "failed to connect to {datahub-dev-elasticsearch-master-2}{XA6xRBSaTf6glW3xpDFiYw}{6RzEuHayRLGI0H8yZc9S1A}{172.19.5.28}{172.19.5.28:9300}{cdfhilmrstw}{ml.machine_memory=2147483648, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} (tried [1] times)", "cluster.uuid": "VivpgBRIT-CL6QmRX2MmBA", "node.id": "QSuXnBsEQ8OmOFB3abO82Q" ,

"stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [datahub-dev-elasticsearch-master-2][172.19.5.28:9300] connect_exception",

"at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1047) ~[elasticsearch-7.17.3.jar:7.17.3]",

"at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$0(ActionListener.java:279) ~[elasticsearch-7.17.3.jar:7.17.3]",

"at org.elasticsearch.core.CompletableContext.lambda$addListener$0(CompletableContext.java:31) ~[elasticsearch-core-7.17.3.jar:7.17.3]",

"at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]",

"at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]",

"at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]",

"at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]",

"at org.elasticsearch.core.CompletableContext.completeExceptionally(CompletableContext.java:46) ~[elasticsearch-core-7.17.3.jar:7.17.3]",

"at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:58) ~[?:?]",

"at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]",

"at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]",

"at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]",

"at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]",

"at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]",

"at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[?:?]",

"at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]",

"at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]",

"at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]",

"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707) ~[?:?]",

"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?]",

"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?]",

"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]",

"at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]",

"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",

"at java.lang.Thread.run(Thread.java:833) [?:?]",

"Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 172.19.5.28/172.19.5.28:9300",

"Caused by: java.net.ConnectException: Connection refused",

"at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]",

"at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]",

"at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]",

"at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]",

"at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]",

"... 7 more"] }

bulky-soccer-26729

08/10/2022, 4:54 PM

okay gotcha! this is helpful, let me see if I can get someone more familiar with elasticsearch and potential issues like this

gentle-camera-33498

08/10/2022, 4:58 PM

One observation that may be is important, that I had to use a more recent version of Elasticsearch because of the version of my Kubernetes cluster. So, I'm using 7.17.3 version but the default community chart uses 7.16.2

gentle-camera-33498

08/10/2022, 4:59 PM

Community pre requisites: https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/Chart.yaml#L10

gentle-camera-33498

08/10/2022, 5:01 PM

My cluster version: 1.21.13-gke.900 But we will upgrade to 1.22 soon.

orange-night-91387

08/10/2022, 5:14 PM

A couple of things: Can you try ticking preserve log in the network tab? It's possible there is some log loss due to a redirect or the refreshing. Can you expand the extensions section in the console error? There should be additional information in subsections of the extensions about the error itself, if not a full stack trace then at least the basic error. From the path given in the error it looks like it is happening on the browse call

orange-night-91387

08/10/2022, 5:15 PM

To clarify: there are two errors happening in this thread. The reproduction is happening on a different graphQL path

orange-night-91387

08/10/2022, 6:18 PM

503s are errors on the ingress side

orange-night-91387

08/10/2022, 6:19 PM

so it would make sense that you aren't getting a GMS error, it never reaches GMS

orange-night-91387

08/10/2022, 6:19 PM

Also seems like some browser caching issues happening when the frontend is trying to get the currently authenticated user

gentle-camera-33498

08/10/2022, 6:24 PM

Could you give me some tips on what I could do to resolve this problem?

gentle-camera-33498

08/10/2022, 6:25 PM

It is not so frequent but happens. I want to avoid this.

big-carpet-38439

08/10/2022, 11:10 PM

The last persistence exception could be occurring when you're exhausting your SQL threadpool (we've seen similar exceptions in such cases)

big-carpet-38439

08/10/2022, 11:10 PM

Are you using a managed service?

gentle-camera-33498

08/11/2022, 1:01 PM

Yes. I'm using Google Cloud SQL

gentle-camera-33498

08/11/2022, 1:03 PM

Do you have any guide for setting up the connection pool?

gentle-camera-33498

08/11/2022, 4:50 PM

Besides that, GMS sends a lot of these log messages. Is this a problem? @orange-night-91387

[pool-6-thread-1] WARN org.elasticsearch.client.RestClient:65 - request [POST http://...:9200/datahubpolicyindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true] returned 1 warnings: [299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]

orange-night-91387

08/11/2022, 5:06 PM

No, that one's not an issue. It's making the assumption you're using frozen indices due to a parameter being sent. It happens on newer versions of ES because they've deprecated it. The param should get removed eventually, but we're not using frozen indices

3 Views

Open in Slack

Previous Next