Good morning (in Seattle) folks. i wanted some he...
# troubleshooting
j
Good morning (in Seattle) folks. i wanted some help troubleshooting a Pinot (0.6.0) upsert table. For context: 1. This table was deployed to our staging environment and production environment. Exact same schema and tablespec. Works fine in staging streaming junk data. Not so much in production on real data. 2. Retention time is 10 days. 3. After periods of idleness, we are seeing cases where the production instance returning no data. Try again 10 minutes later and everything is fine. 4. Querying for age of the newest record, it’s about 2 minutes old in production. Which seems right. 5. Some observations I noticed: a. Our time column (processed_at) is not the same as our sorted column index (created_at_seconds) b. We are on Pinot 0.6.0 (old bug?) c. We have only two upsert tables like this providing different views of the data on the cluster. d. The cluster is resourced for “testing.” Does Pinot evict idle tables out of memory? Could it be slow to reload it because of the index? Is it the resources? Is there a known bug I’m htiting? cc: @Elon @Xiang Fu @Neha Pawar
FYI: @Chundong Wang @Lakshmanan Velusamy
👍 1
I’ve reproduced this behavior twice. Yesterday upon creation of the tables. And today, having left them idle for the last 14 hours.
x
For idle table, is your queries timed out ?
j
no error, just no results in the query ui
ran a size() op through the swagger and got a bunch of ‘-1’ on the segments and such
x
How long you waited the query response?
@Jackie might have some more insights
e
Could this be due to direct memory oom? You can find out by looking at the server logs
x
Left idle should be fine
My feeling is server got restarted as well
j
before it was about 10 seconds before i would get no results, eventually it took a little less time and returned results, then results became fast.
getting 0 results again now
x
10 sec is some internal default timeout
j
and then results again…
j
How long have you been running this table? Any segment pass 10 days retention?
j
looks from the logs there was a server restart about 4 minutes ago
the table is a day old
j
If you have only one replica, then server restart will cause data loss
j
The oldest data from the stream is around 10 days old.
Copy code
"segmentsConfig": {
  "schemaName": "enriched_station_orders_v1_14_rt_upsert_v2_0",
  "retentionTimeUnit": "DAYS",
  "retentionTimeValue": "10",
  "timeColumnName": "processed_at",
  "timeType": "MILLISECONDS",
  "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
  "segmentPushFrequency": "daily",
  "segmentPushType": "APPEND",
  "replicasPerPartition": "3"
},
Replicas looks like 3
e
When did it last occur?
was this on the staging or production cluster?
j
prod
j
Is the server restarted normally or just killed somehow?
j
10:15am @Elon on server 0
I didn’t request the restart if that’s what you’re asking. I’m not seeing anything in the kubernetes log prior to the restart.. Let me check the other servers.
e
You can check the logs for the servers in kibana, I'm seeing this:
Copy code
java.lang.RuntimeException: Inconsistent data read. Index data file /var/pinot/server/data/index/enriched_customer_orders_v1_14_rt_upsert_v2_0_REALTIME/enriched_customer_orders_v1_14_rt_upsert_v2_0__8__4__20210618T0857Z/v3/columns.psf is possibly corrupted
Today at 10:18am
You can ignore the "Cannot find classloader for class errors" - that's happens when the server starts, will be fixed in an upcoming pr.
🙌 1
j
Found the error on server-2
e
data read error?
j
This error is logged when the magic marker validation failed, which means the data file is corrupted somehow
👍 1
Probably because some hard failure during segment creation
Restarting the server should try to download a new copy from the deep storage
j
Is this an area where stability fixes were made in 0.7.1?
j
AFAIK no. This error should be able to auto-recover though
Can you please provide the query stats for the empty response?
j
how do I get those?
Also, right not our sorted column index is not on the same column as is our time column. Will this cause performance degradation for the queries on the upserted data?
e
Would have to test that as well - depends on the queries
j
just a normal select *
@Jackie We’re intermittantly getting the error: [ { “message”: “ServerTableMissing:\nFailed to find table: enriched_station_orders_v1_14_rt_upsert_v2_1_REALTIME”, “errorCode”: 230 } ]
j
If you are using the query console, you can show the JSON response which should have the query stats inside
The
ServerTableMissing
is not normal. Does it happen when the server is restarted unintentionally?
j
@Jackie how do I show the json?
nvm, i see it
We are seeing this error, but not sure if it’s related:
Copy code
@timestamp: Jun 18, 2021 @ 13:52:47.195 -07:00
_id: w3HlIHoB6R61qWfdxh39
_index: logging-production-us-central1:.k8s-container-logs-001288
_score: -  
_type: _doc
kubernetes.cluster_name: data-cluster
kubernetes.cluster_region: us-central1
kubernetes.container_name: server
<http://kubernetes.labels.app|kubernetes.labels.app>: pinot
kubernetes.namespace_name: pinot-dev
kubernetes.pod_name: pinot-upsert-server-zonal-2
payload.text: Terminating due to java.lang.OutOfMemoryError: Java heap space
I’m much more curious about this because it seems to happen with regularity.
x
Copy code
Terminating due to java.lang.OutOfMemoryError: Java heap space
it’s oom
e
Should be fine, scaled up.