Hi! I'm getting some errors while browsing the fro...
# troubleshoot
g
Hi! I'm getting some errors while browsing the frontend, but when I open the networking tab there is no problem with the calls to the backend. Does anyone have an idea what it could be?
The GMS server does not have any error log message
The problem persists even running the restore indices job and refreshing the page. But when I run CTRL+F5, it works...
b
hey Patrick! so things are working now with ctrl + f5 for you? or are you still seeing issues?
even though the responses were all 200 in your network tab, that doesn't mean they were successful since graphql will return a 200 http response, but in the result payload it will contain errors and an error status
and it looks like it was a 500 error, so for some reason your server was down. had you just restarted or booted things up per chance?
g
After using CRTL_F5 it works. However, when I leave the session open for a long time and try to use it again, these problems happen again.
even though the responses were all 200 in your network tab, that doesn't mean they were successful since graphql will return a 200 http response, but in the result payload it will contain errors and an error status
Hmm, I didn't know about that. It would be incredible if I could catch these errors in logs
b
okay gotcha gotcha.. how long would a long time be if you had to guess?
g
and it looks like it was a 500 error, so for some reason your server was down. had you just restarted or booted things up per chance?
No, I deployed the Helm chart 44 hours ago. Everything works after a full page refresh. But, as I said, these errors come back after a long idle period.
okay gotcha gotcha.. how long would a long time be if you had to guess?
I would guess about 2 hours of downtime
b
okay thanks! I'll see if I can try and reproduce and look into this some. You said you were looking at your GMS logs after receiving these error messages and you didn't see anything in there?
g
Yes. I deployed the helm chart in Google Kubernetes Engine and used Cloud Monitoring to filter logs. Se bellow the filter that I'm using:
resource.type="k8s_container"
resource.labels.cluster_name="dsc-prod"
resource.labels.namespace_name="prod"
resource.labels.pod_name:"datahub-gms-server"
severity=(ERROR OR WARNING OR CRITICAL OR ALERT OR EMERGENCY)
I tried to increase the searchService batch return size and enabled cache, but it just increased the memory consumption
Do you have any idea of what could cause this?
b
ahh so this might be a performance thing then
do you see any logs in your elasticsearch pod?
it looks like it's failing on search specifically
g
Let me check the elasticsearch logs
Strange... with kubectl logs command, I found this error:
{"type": "server", "timestamp": "2022-08-07T155903,462Z", "level": "WARN", "component": "o.e.c.NodeConnectionsService", "cluster.name": "datahub-dev-elasticsearch", "node.name": "datahub-dev-elasticsearch-master-0", "message": "failed to connect to {datahub-dev-elasticsearch-master-2}{XA6xRBSaTf6glW3xpDFiYw}{6RzEuHayRLGI0H8yZc9S1A}{172.19.5.28}{172.19.5.28:9300}{cdfhilmrstw}{ml.machine_memory=2147483648, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} (tried [1] times)", "cluster.uuid": "VivpgBRIT-CL6QmRX2MmBA", "node.id": "QSuXnBsEQ8OmOFB3abO82Q" ,
"stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [datahub-dev-elasticsearch-master-2][172.19.5.28:9300] connect_exception",
"at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1047) ~[elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$0(ActionListener.java:279) ~[elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.core.CompletableContext.lambda$addListener$0(CompletableContext.java:31) ~[elasticsearch-core-7.17.3.jar:7.17.3]",
"at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]",
"at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]",
"at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]",
"at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]",
"at org.elasticsearch.core.CompletableContext.completeExceptionally(CompletableContext.java:46) ~[elasticsearch-core-7.17.3.jar:7.17.3]",
"at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:58) ~[?:?]",
"at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]",
"at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]",
"at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]",
"at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]",
"at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]",
"at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[?:?]",
"at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]",
"at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]",
"at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 172.19.5.28/172.19.5.28:9300",
"Caused by: java.net.ConnectException: Connection refused",
"at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]",
"at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]",
"at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]",
"at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]",
"at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]",
"... 7 more"] }
b
okay gotcha! this is helpful, let me see if I can get someone more familiar with elasticsearch and potential issues like this
g
One observation that may be is important, that I had to use a more recent version of Elasticsearch because of the version of my Kubernetes cluster. So, I'm using 7.17.3 version but the default community chart uses 7.16.2
My cluster version: 1.21.13-gke.900 But we will upgrade to 1.22 soon.
o
A couple of things: Can you try ticking preserve log in the network tab? It's possible there is some log loss due to a redirect or the refreshing. Can you expand the extensions section in the console error? There should be additional information in subsections of the extensions about the error itself, if not a full stack trace then at least the basic error. From the path given in the error it looks like it is happening on the browse call
To clarify: there are two errors happening in this thread. The reproduction is happening on a different graphQL path
503s are errors on the ingress side
so it would make sense that you aren't getting a GMS error, it never reaches GMS
Also seems like some browser caching issues happening when the frontend is trying to get the currently authenticated user
g
Could you give me some tips on what I could do to resolve this problem?
It is not so frequent but happens. I want to avoid this.
b
The last persistence exception could be occurring when you're exhausting your SQL threadpool (we've seen similar exceptions in such cases)
Are you using a managed service?
g
Yes. I'm using Google Cloud SQL
Do you have any guide for setting up the connection pool?
Besides that, GMS sends a lot of these log messages. Is this a problem? @orange-night-91387
[pool-6-thread-1] WARN org.elasticsearch.client.RestClient:65 - request [POST http://...:9200/datahubpolicyindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true] returned 1 warnings: [299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
o
No, that one's not an issue. It's making the assumption you're using frozen indices due to a parameter being sent. It happens on newer versions of ES because they've deprecated it. The param should get removed eventually, but we're not using frozen indices