Laxman Ch
03/25/2021, 3:57 PMJackie
03/25/2021, 5:56 PMLaxman Ch
03/25/2021, 6:02 PMJackie
03/25/2021, 6:03 PMERROR
log with logger name LLRealtimeSegmentDataManager_<segmentName>
Laxman Ch
03/25/2021, 6:17 PMXiang Fu
Laxman Ch
03/25/2021, 6:42 PMXiang Fu
Laxman Ch
03/26/2021, 1:08 PMLaxman Ch
03/26/2021, 1:09 PMAt pinot bootstrap, should we register a default uncaught exception handler which just logs them with full stack trace.@Xiang Fu: Should we have one?
Laxman Ch
03/31/2021, 11:29 AM2021/03/31 07:31:39.825 INFO [LLRealtimeSegmentDataManager_spanEventView__27__2251__20210331T0731Z] [HelixTaskExecutor-message_handle_thread] RealtimeDataResourceZKMetadata contains no information about sorted column for segment spanEventView__27__2251__20210331T0731Z
• External view of the table also doesn’t show this server for this specific segment
• Issue gets fixed after restart of problematic pinot-serverLaxman Ch
03/31/2021, 1:08 PM2021/03/31 07:31:37.786 INFO [PinotLLCRealtimeSegmentManager] [grizzly-http-server-1] Creating segment ZK metadata for new CONSUMING segment: spanEventView__27__2251__20210331T0731Z with start offset: 166204784 and creation time: 1617175897785
2021/03/31 07:31:37.786 INFO [SegmentSizeBasedFlushThresholdUpdater] [grizzly-http-server-1] spanEventView__27__2251__20210331T0731Z: Data from committing segment: Time 1h32m3s numRows 182331 threshold 182331 segmentSize(bytes) 326256819
2021/03/31 07:31:37.786 INFO [SegmentSizeBasedFlushThresholdUpdater] [grizzly-http-server-1] Committing segment size 326256819, current ratio 4.443249882066186E-4, setting threshold for spanEventView__27__2251__20210331T0731Z as 186363
2021/03/31 07:31:37.786 INFO [PinotLLCRealtimeSegmentManager] [grizzly-http-server-1] Persisting segment ZK metadata for segment: spanEventView__27__2251__20210331T0731Z
2021/03/31 07:31:37.854 INFO [RealtimeSegmentAssignment] [grizzly-http-server-1] Assigning segment: spanEventView__27__2251__20210331T0731Z with instance partitions: {""instancePartitionsName"":""spanEventView_CONSUMING"",""partitionToInstancesMap"":{""0_0"":[""Server_pinot-server-3.pinot-server.traceable.svc.cluster.local_8098"",""Server_pinot-server-0.pinot-server.traceable.svc.cluster.local_8098"",""Server_pinot-server-1.pinot-server.traceable.svc.cluster.local_8098"",""Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098""]}} for table: spanEventView_REALTIME
2021/03/31 07:31:37.854 INFO [RealtimeSegmentAssignment] [grizzly-http-server-1] Assigned segment: spanEventView__27__2251__20210331T0731Z to instances: [Server_pinot-server-1.pinot-server.traceable.svc.cluster.local_8098, Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098] for table: spanEventView_REALTIME
2021/03/31 07:31:37.854 INFO [PinotLLCRealtimeSegmentManager] [grizzly-http-server-1] Adding new CONSUMING segment: spanEventView__27__2251__20210331T0731Z to instances: [Server_pinot-server-1.pinot-server.traceable.svc.cluster.local_8098, Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098]
2021/03/31 07:31:39.689 INFO [MessageDispatchStage] [HelixController-pipeline-default-traceable-views-(d7a6546b_DEFAULT)] Event d7a6546b_DEFAULT : Sending Message 6d73e84e-fa4d-4987-b026-7cb93397b7a9 to Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098 transit spanEventView_REALTIME.spanEventView__27__2251__20210331T0731Z|[] from:OFFLINE to:CONSUMING, relayMessages: 0
2021/03/31 07:31:39.689 INFO [MessageDispatchStage] [HelixController-pipeline-default-traceable-views-(d7a6546b_DEFAULT)] Event d7a6546b_DEFAULT : Sending Message 803e30a5-3c3b-4cb2-9f2d-065ee9a78404 to Server_pinot-server-1.pinot-server.traceable.svc.cluster.local_8098 transit spanEventView_REALTIME.spanEventView__27__2251__20210331T0731Z|[] from:OFFLINE to:CONSUMING, relayMessages: 0
2021/03/31 07:31:40.937 INFO [MessageGenerationPhase] [HelixController-pipeline-default-traceable-views-(f69e3778_DEFAULT)] Event f69e3778_DEFAULT : Message already exists for Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098 to transit spanEventView_REALTIME.spanEventView__27__2251__20210331T0731Z from OFFLINE to CONSUMING, isRelay: false
2021/03/31 07:31:40.937 INFO [MessageGenerationPhase] [HelixController-pipeline-default-traceable-views-(f69e3778_DEFAULT)] Event f69e3778_DEFAULT : Message already exists for Server_pinot-server-1.pinot-server.traceable.svc.cluster.local_8098 to transit spanEventView_REALTIME.spanEventView__27__2251__20210331T0731Z from OFFLINE to CONSUMING, isRelay: false
2021/03/31 07:31:41.892 INFO [MessageGenerationPhase] [HelixController-pipeline-default-traceable-views-(1de6a604_DEFAULT)] Event 1de6a604_DEFAULT : Message already exists for Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098 to transit spanEventView_REALTIME.spanEventView__27__2251__20210331T0731Z from OFFLINE to CONSUMING, isRelay: false
Laxman Ch
03/31/2021, 1:09 PM2021/03/31 07:31:39.730 INFO [HelixTaskExecutor] [ZkClient-EventThread-22-zookeeper-0.zookeeper-headless.traceable.svc.cluster.local:2181,zookeeper-1.zookeeper-headless.traceable.svc.cluster.local:2181,zookeeper-2.zookeeper-headless.traceable.svc.cluster.local:2181/pinot] Scheduling message 6d73e84e-fa4d-4987-b026-7cb93397b7a9: spanEventView_REALTIME:spanEventView__27__2251__20210331T0731Z, OFFLINE->CONSUMING
2021/03/31 07:31:39.730 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread] handling message: 6d73e84e-fa4d-4987-b026-7cb93397b7a9 transit spanEventView_REALTIME.spanEventView__27__2251__20210331T0731Z|[] from:OFFLINE to:CONSUMING, relayedFrom: null
2021/03/31 07:31:39.811 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread] Instance Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098, partition spanEventView__27__2251__20210331T0731Z received state transition from OFFLINE to CONSUMING on session 377be7c4e47028e, message id: 6d73e84e-fa4d-4987-b026-7cb93397b7a9
2021/03/31 07:31:39.811 INFO [SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread] SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline() : ZnRecord=6d73e84e-fa4d-4987-b026-7cb93397b7a9, {CREATE_TIMESTAMP=1617175899403, ClusterEventName=MessageChange, EXECUTE_START_TIMESTAMP=1617175899730, EXE_SESSION_ID=377be7c4e47028e, FROM_STATE=OFFLINE, MSG_ID=6d73e84e-fa4d-4987-b026-7cb93397b7a9, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=spanEventView__27__2251__20210331T0731Z, READ_TIMESTAMP=1617175899722, RESOURCE_NAME=spanEventView_REALTIME, RESOURCE_TAG=spanEventView_REALTIME, SRC_NAME=pinot-controller-1.pinot-controller.traceable.svc.cluster.local_9000, SRC_SESSION_ID=177be7a05650257, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098, TGT_SESSION_ID=377be7c4e47028e, TO_STATE=CONSUMING}{}{}, Stat=Stat {_version=0, _creationTime=1617175899694, _modifiedTime=1617175899694, _ephemeralOwner=0}
2021/03/31 07:31:39.811 INFO [SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread] SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline() : ZnRecord=6d73e84e-fa4d-4987-b026-7cb93397b7a9, {CREATE_TIMESTAMP=1617175899403, ClusterEventName=MessageChange, EXECUTE_START_TIMESTAMP=1617175899730, EXE_SESSION_ID=377be7c4e47028e, FROM_STATE=OFFLINE, MSG_ID=6d73e84e-fa4d-4987-b026-7cb93397b7a9, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=spanEventView__27__2251__20210331T0731Z, READ_TIMESTAMP=1617175899722, RESOURCE_NAME=spanEventView_REALTIME, RESOURCE_TAG=spanEventView_REALTIME, SRC_NAME=pinot-controller-1.pinot-controller.traceable.svc.cluster.local_9000, SRC_SESSION_ID=177be7a05650257, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098, TGT_SESSION_ID=377be7c4e47028e, TO_STATE=CONSUMING}{}{}, Stat=Stat {_version=0, _creationTime=1617175899694, _modifiedTime=1617175899694, _ephemeralOwner=0}
2021/03/31 07:31:39.811 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Adding segment: spanEventView__27__2251__20210331T0731Z to table: spanEventView_REALTIME
2021/03/31 07:31:39.825 INFO [LLRealtimeSegmentDataManager_spanEventView__27__2251__20210331T0731Z] [HelixTaskExecutor-message_handle_thread] RealtimeDataResourceZKMetadata contains no information about sorted column for segment spanEventView__27__2251__20210331T0731Z
Laxman Ch
03/31/2021, 2:38 PMLaxman Ch
03/31/2021, 2:44 PMXiang Fu
Xiang Fu
Laxman Ch
03/31/2021, 6:48 PMsegment is still in consuming status in idealstatesI checked external view and the server is missing in the list. I didn’t check the ideal state. Its happening every few hours for us. Will check next time provide details.
Laxman Ch
03/31/2021, 6:49 PM2021/03/31 07:31:39.825 INFO [LLRealtimeSegmentDataManager_spanEventView__27__2251__20210331T0731Z] [HelixTaskExecutor-message_handle_thread] RealtimeDataResourceZKMetadata contains no information about sorted column for segment spanEventView__27__2251__20210331T0731Z
This is the last log line from the segment consumption thread before it dies.Mayank
Mayank
Laxman Ch
03/31/2021, 7:04 PMMayank
1. Once in a while some servers stop consuming.
2. Segment is in consuming in ideal-state
3. Server is missing from external view.
4. Logs seem to suggest that an attempt commit the segment may have caused this?
Laxman Ch
03/31/2021, 7:27 PM1. Once in a while some servers stop consuming some of the assigned partitions (Say 2 out of 16 partitions of a table) stops consuming from Kafka.
2. Segment is in consuming in ideal-state (DON'T KNOW. WILL PROVIDE THESE DETAILS WHEN IT HAPPENS AGAIN)
3. Server is missing from external view ONLY for the segments where the consumption thread dies.
4. Logs seem to suggest that an attempt commit an OLD segment and start a NEW segment may have caused this.
Laxman Ch
03/31/2021, 7:31 PM"spanEventView__27__2251__20210331T0731Z" : {
"Server_pinot-server-1.pinot-server.traceable.svc.cluster.local_8098" : "CONSUMING"
}
External view of the segment where consumption thread is running fine on pinot-server-2.
"spanEventView__13__2255__20210331T0620Z" : {
"Server_pinot-server-1.pinot-server.traceable.svc.cluster.local_8098" : "CONSUMING",
"Server_pinot-server-2.pinot-server.traceable.svc.cluster.local_8098" : "CONSUMING"
}
Xiang Fu
Xiang Fu
Laxman Ch
03/31/2021, 7:43 PMLaxman Ch
04/01/2021, 5:46 AM2. Segment is in consuming in ideal-state (DON’T KNOW. WILL PROVIDE THESE DETAILS WHEN IT HAPPENS AGAIN)Yes. All segments are in CONSUMING in ideal state. But they are not there in external view. cc: @Mayank @Xiang Fu
Jackie
04/01/2021, 5:42 PMspanEventView__13__2255__*
is the new consuming segment, what's the state of spanEventView__13__2254___*
?Jackie
04/01/2021, 5:43 PMLaxman Ch
04/14/2021, 9:17 AM"HelixTaskExecutor-message_handle_thread" #51 daemon prio=5 os_prio=0 cpu=70457.28ms elapsed=4885.80s tid=0x00007fe4e43d6000 nid=0x6e waiting for monitor entry [0x00007fe4aa6e5000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.codehaus.groovy.reflection.ClassInfo$GlobalClassSet.add(ClassInfo.java:477)
- waiting to lock <0x0000000702ed2218> (a org.codehaus.groovy.util.ManagedLinkedList)
at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:83)
at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:79)
at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.<init>(GroovyClassValuePreJava7.java:37)
at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:64)
at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:55)
at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:120)
at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100)
at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38)
at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94)
at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144)
at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258)
at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883)
at groovy.lang.GroovyObjectSupport.<init>(GroovyObjectSupport.java:34)
at groovy.lang.Script.<init>(Script.java:42)
at Script1.<init>(Script1.groovy)
at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.10/Native Method)
at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.10/NativeConstructorAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.10/DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(java.base@11.0.10/Constructor.java:490)
at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:431)
at groovy.lang.GroovyShell.parse(GroovyShell.java:700)
at groovy.lang.GroovyShell.parse(GroovyShell.java:736)
at groovy.lang.GroovyShell.parse(GroovyShell.java:727)
at org.apache.pinot.core.data.function.GroovyFunctionEvaluator.<init>(GroovyFunctionEvaluator.java:73)
at org.apache.pinot.core.data.function.FunctionEvaluatorFactory.getExpressionEvaluator(FunctionEvaluatorFactory.java:91)
at org.apache.pinot.core.data.function.FunctionEvaluatorFactory.getExpressionEvaluator(FunctionEvaluatorFactory.java:78)
at org.apache.pinot.core.util.IngestionUtils.extractFieldsFromSchema(IngestionUtils.java:61)
at org.apache.pinot.core.util.IngestionUtils.getFieldsForRecordExtractor(IngestionUtils.java:50)
at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1186)
at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:314)
at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:133)
at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164)
at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:88)
at jdk.internal.reflect.GeneratedMethodAccessor723.invoke(Unknown Source)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.10/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@11.0.10/Method.java:566)
at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)
at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)
- locked <0x000000072defa200> (a org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel)
at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)
at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)
at java.util.concurrent.FutureTask.run(java.base@11.0.10/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.10/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.10/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.10/Thread.java:834)
Laxman Ch
04/14/2021, 9:20 AMXiang Fu
Xiang Fu
Laxman Ch
04/14/2021, 10:15 AMLaxman Ch
04/14/2021, 10:16 AMLaxman Ch
04/14/2021, 9:49 PM-Dgroovy.use.classvalue=true
JVM argument as suggested in GROOVY-8067. Workaround seems to be working fine. No occurrence in last 12 hours run (Incident used to occur every 2-3 hours earlier). Will continue to use this as workaround and upgrade to 0.7.1 as needed.
@Xiang Fu @Mayank: fyiMayank
Mayank
Xiang Fu
Suraj
04/15/2021, 12:44 AMLaxman Ch
04/15/2021, 3:22 AMworkaround is not recommended as it could lead to memory leaks• GROOVY-7591 is fixed in 2.4.5. We are on 2.4.8 • This is actually due to a JDK bug and that got fixed in JDK 9
Xiang Fu
Ravi Singal
04/15/2021, 4:16 AMLaxman Ch
04/15/2021, 4:31 AMLaxman Ch
04/15/2021, 6:20 AMXiang Fu
Laxman Ch
04/15/2021, 7:09 AMMayank
Xiang Fu
Mayank
Laxman Ch
04/15/2021, 5:51 PMMayank