Good morning in Seattle folks i wanted some help troubleshoo Apache Pinot #troubleshooting

Good morning (in Seattle) folks. i wanted some he...

Jai Patel

06/18/2021, 4:51 PM

Good morning (in Seattle) folks. i wanted some help troubleshooting a Pinot (0.6.0) upsert table. For context: 1. This table was deployed to our staging environment and production environment. Exact same schema and tablespec. Works fine in staging streaming junk data. Not so much in production on real data. 2. Retention time is 10 days. 3. After periods of idleness, we are seeing cases where the production instance returning no data. Try again 10 minutes later and everything is fine. 4. Querying for age of the newest record, it’s about 2 minutes old in production. Which seems right. 5. Some observations I noticed: a. Our time column (processed_at) is not the same as our sorted column index (created_at_seconds) b. We are on Pinot 0.6.0 (old bug?) c. We have only two upsert tables like this providing different views of the data on the cluster. d. The cluster is resourced for “testing.” Does Pinot evict idle tables out of memory? Could it be slow to reload it because of the index? Is it the resources? Is there a known bug I’m htiting? cc: @Elon @Xiang Fu @Neha Pawar

Jai Patel

06/18/2021, 4:51 PM

FYI: @Chundong Wang @Lakshmanan Velusamy

👍 1

Jai Patel

06/18/2021, 4:52 PM

I’ve reproduced this behavior twice. Yesterday upon creation of the tables. And today, having left them idle for the last 14 hours.

Xiang Fu

06/18/2021, 5:10 PM

For idle table, is your queries timed out ?

Jai Patel

06/18/2021, 5:11 PM

no error, just no results in the query ui

Jai Patel

06/18/2021, 5:12 PM

ran a size() op through the swagger and got a bunch of ‘-1’ on the segments and such

Xiang Fu

06/18/2021, 5:12 PM

How long you waited the query response?

Xiang Fu

06/18/2021, 5:12 PM

@Jackie might have some more insights

Elon

06/18/2021, 5:13 PM

Could this be due to direct memory oom? You can find out by looking at the server logs

Xiang Fu

06/18/2021, 5:13 PM

Left idle should be fine

Xiang Fu

06/18/2021, 5:13 PM

My feeling is server got restarted as well

Jai Patel

06/18/2021, 5:13 PM

before it was about 10 seconds before i would get no results, eventually it took a little less time and returned results, then results became fast.

Jai Patel

06/18/2021, 5:13 PM

getting 0 results again now

Xiang Fu

06/18/2021, 5:14 PM

10 sec is some internal default timeout

Jai Patel

06/18/2021, 5:14 PM

and then results again…

Jackie

06/18/2021, 5:18 PM

How long have you been running this table? Any segment pass 10 days retention?

Jai Patel

06/18/2021, 5:19 PM

looks from the logs there was a server restart about 4 minutes ago

Jai Patel

06/18/2021, 5:19 PM

the table is a day old

Jackie

06/18/2021, 5:20 PM

If you have only one replica, then server restart will cause data loss

Jai Patel

06/18/2021, 5:20 PM

The oldest data from the stream is around 10 days old.

Jai Patel

06/18/2021, 5:20 PM

Copy code

"segmentsConfig": {
  "schemaName": "enriched_station_orders_v1_14_rt_upsert_v2_0",
  "retentionTimeUnit": "DAYS",
  "retentionTimeValue": "10",
  "timeColumnName": "processed_at",
  "timeType": "MILLISECONDS",
  "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
  "segmentPushFrequency": "daily",
  "segmentPushType": "APPEND",
  "replicasPerPartition": "3"
},

Jai Patel

06/18/2021, 5:20 PM

Replicas looks like 3

Elon

06/18/2021, 5:21 PM

When did it last occur?

Elon

06/18/2021, 5:21 PM

was this on the staging or production cluster?

Jai Patel

06/18/2021, 5:21 PM

prod

Jackie

06/18/2021, 5:21 PM

Is the server restarted normally or just killed somehow?

Jai Patel

06/18/2021, 5:22 PM

10:15am @Elon on server 0

Jai Patel

06/18/2021, 5:22 PM

I didn’t request the restart if that’s what you’re asking. I’m not seeing anything in the kubernetes log prior to the restart.. Let me check the other servers.

Elon

06/18/2021, 5:29 PM

You can check the logs for the servers in kibana, I'm seeing this:

Elon

06/18/2021, 5:29 PM

Copy code

java.lang.RuntimeException: Inconsistent data read. Index data file /var/pinot/server/data/index/enriched_customer_orders_v1_14_rt_upsert_v2_0_REALTIME/enriched_customer_orders_v1_14_rt_upsert_v2_0__8__4__20210618T0857Z/v3/columns.psf is possibly corrupted

Elon

06/18/2021, 5:29 PM

Today at 10:18am

Elon

06/18/2021, 5:30 PM

You can ignore the "Cannot find classloader for class errors" - that's happens when the server starts, will be fixed in an upcoming pr.

🙌 1

Jai Patel

06/18/2021, 5:39 PM

Found the error on server-2

Elon

06/18/2021, 5:40 PM

data read error?

Jackie

06/18/2021, 5:41 PM

This error is logged when the magic marker validation failed, which means the data file is corrupted somehow

👍 1

Jackie

06/18/2021, 5:41 PM

Probably because some hard failure during segment creation

Jackie

06/18/2021, 5:42 PM

Restarting the server should try to download a new copy from the deep storage

Jai Patel

06/18/2021, 5:42 PM

Is this an area where stability fixes were made in 0.7.1?

Jackie

06/18/2021, 5:47 PM

AFAIK no. This error should be able to auto-recover though

Jackie

06/18/2021, 5:48 PM

Can you please provide the query stats for the empty response?

Jai Patel

06/18/2021, 5:51 PM

how do I get those?

Jai Patel

06/18/2021, 6:16 PM

Also, right not our sorted column index is not on the same column as is our time column. Will this cause performance degradation for the queries on the upserted data?

Elon

06/18/2021, 6:17 PM

Would have to test that as well - depends on the queries

Jai Patel

06/18/2021, 6:29 PM

just a normal select *

Jai Patel

06/18/2021, 7:03 PM

@Jackie We’re intermittantly getting the error: [ { “message”: “ServerTableMissing:\nFailed to find table: enriched_station_orders_v1_14_rt_upsert_v2_1_REALTIME”, “errorCode”: 230 } ]

Jackie

06/18/2021, 8:09 PM

If you are using the query console, you can show the JSON response which should have the query stats inside

Jackie

06/18/2021, 8:10 PM

The

ServerTableMissing

is not normal. Does it happen when the server is restarted unintentionally?

Jai Patel

06/18/2021, 8:33 PM

@Jackie how do I show the json?

Jai Patel

06/18/2021, 8:34 PM

nvm, i see it

Jai Patel

06/18/2021, 9:25 PM

We are seeing this error, but not sure if it’s related:

Copy code

@timestamp: Jun 18, 2021 @ 13:52:47.195 -07:00
_id: w3HlIHoB6R61qWfdxh39
_index: logging-production-us-central1:.k8s-container-logs-001288
_score: -  
_type: _doc
kubernetes.cluster_name: data-cluster
kubernetes.cluster_region: us-central1
kubernetes.container_name: server
<http://kubernetes.labels.app|kubernetes.labels.app>: pinot
kubernetes.namespace_name: pinot-dev
kubernetes.pod_name: pinot-upsert-server-zonal-2
payload.text: Terminating due to java.lang.OutOfMemoryError: Java heap space

Jai Patel

06/18/2021, 9:26 PM

I’m much more curious about this because it seems to happen with regularity.

Xiang Fu

06/18/2021, 10:20 PM

Copy code

Terminating due to java.lang.OutOfMemoryError: Java heap space

Xiang Fu

06/18/2021, 10:20 PM

it’s oom

Elon

06/24/2021, 6:21 PM

Should be fine, scaled up.

Open in Slack

Previous Next