Apache Pinot #troubleshooting

Venkat Boina(VB)

05/13/2023, 7:12 PM

@Elon Does passthrough support group by roll up? I am using 0.12 version getting exception as it does not recognise the field declared inside rollup

Chris Han

05/15/2023, 7:47 PM

I'm trying to update an

IDEAL STATE

for a table in Zookeeper. The

IDEAL STATE

json I need to update is over 769,000 characters long (there are over 8000 segments), and when I try to update it I'm receiving a

Bad Request

response, presumably because the request data to Zookeeper is too long. I need to manually update the

DEAD

server IPs with

ALIVE

server IPs. I have over 8000 of these entries:

Copy code

...   
 "table_OFFLINE_8697": {
      "Server_10.193.7.135_8098": "ONLINE"
    },
    "table_OFFLINE_8698": {
      "Server_10.193.7.135_8098": "ONLINE"
    },
...

Is there a way I can iteratively update the

IDEAL STATE

that doesn't require me to upload the entire document? Is there another way I can "migrate" the segments from one server to another within the Zookeeper configs?

Ethan Huang

05/16/2023, 3:27 AM

Hello team, I am trying to add a range index on an exiting column without dictionary index, I got an exception shown in the image. After reading the code, I found that pinot allows creating range index for no-dictionary columns(

DefaultIndexCreatorProvider#newRangeIndexCreator

RangeIndexHandler#handleNonDictionaryBasedColumn

). However, the

BitSlicedRangeIndexCreator

relies on the min and max value of the indexing column, but the

minValue

and

maxValue

are both

null

ColumnMetadata

when the column has no dictionary. is it a bug? or additional configurations needed to avoid such exception? BTW, the version is 0.12.1 release. Thanks.

Venkat Boina(VB)

05/16/2023, 7:44 AM

@channel Does passthrough support group by roll up? I am using 0.12 version getting exception as it does not recognise the field declared inside rollup. @Elon or @Mayank

Lee Wei Hern Jason

05/16/2023, 8:44 AM

Hi Team, is it possible to get auth token from environment variables in extra configs ? I tried using ${} but the value didnt get assign in the pod.

Copy code

envFrom:
    - secretRef:
        name: pinot-secrets

    extra:
      configs: |-
        pinot.set.instance.id.to.hostname=true
        pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
        pinot.minion.storage.factory.s3.region=ap-southeast-1
        pinot.minion.segment.fetcher.protocols=file,http,s3
        pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
        segment.fetcher.auth.token=${PINOT_SEGMENT_FETCHER_AUTH_TOKEN}
        task.auth.token=${PINOT_SEGMENT_FETCHER_AUTH_TOKEN}

Lvszn Peng

05/16/2023, 12:19 PM

hi team, when i upgrade pinot from 0.9.3 to 0.12.1, the pinot-server show me an error

Exception in thread "main" java.lang.NoSuchFieldError: JAVA_11

. Is the Java version to low?

Ehsan Irshad

05/16/2023, 1:21 PM

Hi Team. What are the generic guidelines to fine tune the queries, is my method below correct? (here I am not considering the underlying resources, like number of brokers, servers etc or node sizes) 1. Reduce the

numSegmentsProcessed

by Segment Pruning on broker 2. Reduce the

numEntriesScannedPostFilter

by adding more filters in query 3. Because of 2,

numEntriesScannedInFilter

will increase. So make it 0 by adding the indexes

Deepak Arumugham

05/16/2023, 1:43 PM

Hi All, We are trying to evaluate a use-case of performing full-text queries on our parquet files(TBs) in GCS buckets. Is Pinot the right solution for our use-case? Can we use GCS as our deep storage in Pinot?

Chris Han

05/17/2023, 3:35 PM

I run out of Java heap space when executing queries via the Query Console using v2. Is there guidance on how to appropriately size the heap space? This error is from my server logs

Copy code

Exception in thread "idle-connection-reaper" java.lang.OutOfMemoryError: Java heap space

Deepak Arumugham

05/17/2023, 10:54 PM

We are trying to ingest Parquet files from GCS buckets. And we are planning to use GCS as our deepstore. We've installed Pinot via helm charts. Our Configmap would look like this

controller.data.dir=<gs://pinot-data-dir>

<http://pinot.controller.storage.factory.class.gs|pinot.controller.storage.factory.class.gs>=org.apache.pinot.plugin.filesystem.GcsPinotFS

pinot.controller.segment.fetcher.protocols=file,http,gs

<http://pinot.controller.segment.fetcher.gs|pinot.controller.segment.fetcher.gs>.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Even though we have provided the correct GCS data directory for the controller, the segments are getting created locally in pinot cluster's disk and soon we get into java.lang.OutOfMemoryError: Java heap space. And our parquet files are sized in 50-500 MB range. We are under the impression that on Ingestion, data would be processed and would be created in GCS buckets. Am I missing something here? How can we solve this? Any pointers would be helpful

Michael Roman Wengle

05/18/2023, 5:34 AM

We face the following issues with the `RealtimeToOfflineSegmentsTask`:

Copy code

no native library is found for os.name=Linux and os.arch=aarch64
null
java.lang.NullPointerException
	at xerial.larray.impl.LArrayLoader$NativeLib.extractLibraryFile(LArrayLoader.java:182)

[...]

Copy code

java.lang.UnsatisfiedLinkError: 'long xerial.larray.impl.LArrayNative.mmap(long, int, long, long)'
	at xerial.larray.impl.LArrayNative.mmap(Native Method) ~[pinot-all-0.13.0-SNAPSHOT-jar-with-dependencies.jar:0.13.0-SNAPSHOT-12d86902a84d4bc78b6f2f7bc8bd002659ee61cb]

The minions are deployed on Graviton nodes in k8s (official Pinot Helm chart). Did anyone experience the same problem? Is there a way to solve the issue or do we need to switch to x86 k8s nodes?

no native library is found for os.name Linux and os.arch aarch64.log

Eaugene Thomas

05/18/2023, 7:48 AM

Hi team , do we have any API in pinot controller to get the table size before replication ? the current table size API is giving total size of the table including replication

Deena Dhayalan

05/18/2023, 8:31 AM

Hi team , I having a doubt in distribution of memory while segment creation (Batch Ingestion , Will do in hadoop ways). I need to know that how much heap memory (32GB RAM) needed for how many number of threads I am specifying in segmentCreationJobParallelism for a file size appx 500MB lets say I have 10 orc files in my rawdata folder to ingest

Tommaso Peresson

05/18/2023, 11:48 AM

Hello there, is there a way to set

ConcurrentTasksPerWorker

in the minion config runtime for a

SegmentGenerationAndPushTask

task? Thanks

Tanmay Varun

05/18/2023, 5:16 PM

One small bug in command documentation, page https://docs.pinot.apache.org/basics/getting-started/kubernetes-quickstart

helm install -n pinot-quickstart kafka kafka/kafka --set replicas=1,zookeeper.image.tag=latest

replicas --> replicaCount

Deepak Arumugham

05/19/2023, 5:51 AM

Team, I used Spark to ingest data.. and found a strange case of Segment's state turning to BAD state after ingestion.

Caught Exception in state transtition from OFFLINE -> ONLINE for resource

Can you please provide any insights on this. Once the ingestion is complete, the segment goes to BAD state

And on trying to query, we are getting

{

"errorCode": 305,

"message": "null:\n1 segments unavailable: [xyz_OFFLINE_2021-11-16-17_2022-09-21-00_0]"

}

Sanjay

05/19/2023, 1:07 PM

Hi, I am running an

standalone

ingestion and it tries to copy the input files in

/tmp

directory and eventually that is causing the space issue, is there any parameter to change to some other

mount

path?

Tommaso Peresson

05/19/2023, 4:04 PM

Hello, I’m trying to optimise Minion Ingestion with GCS as deep-store. Currently scheduling

SegmentGenerationAndPushTask

tasks takes minutes and I don’t know how to debug it and optimise it. I thought it was wildcards in the input format triggering a long scan on GCS(as it is a flat FS) but removing them doesn’t help. Can someone pls help me with a checklist of things to look for optimise this process?

J Vossler

05/19/2023, 5:16 PM

We are using prometheus/promtail and storing metrics data in Mimir to be displayed in grafana. Our pinot javaagent is set up to use port 8008 instead of 8888. Is there any standards for what port to use for metrics scrapes? Or just pick anything that is not used?

Raveendra Yerraguntla

05/20/2023, 6:59 PM

Hello This question is about performance.The below query takes almost 10 seconds on. gcp cloud with 3 n2-standard-2 . all the fields are string fields except timestamp which is timestamp indexed.what kind of indexes I need to build for a better performance? I have many more time series queries to be displayed from superset but all are timing out. I am looking for index creation and performance improvement . - SELECT query,product_name, COUNT(*) FROM "default"."clicksTable" WHERE product_name != 'null' GROUP BY product_name, query ORDER BY COUNT(*) DESC LIMIT 10000;

Tanmay Varun

05/20/2023, 10:07 PM

Hi team, one query - setting this function name in table config will work correctly with apache kafka’s partitioning logic assuming same number of partititions ? (assuming apache kafka default partitioner - MurmurHash(key) % numPartitions

Copy code

"segmentPartitionConfig": {
      "columnPartitionMap": {
          "merchantId": {
            "functionName": "Murmur",
            "numPartitions": 36
          }
      }
    },

Ayush Chauhan (Tech)

05/21/2023, 6:09 AM

Can we please add support for PreparedStatement in the go client as we have for the Java client?

Abhijeet Kushe

05/21/2023, 1:35 PM

We are using Realtime table with kinesis.Yesterday we increased our shards from 2 to 4 but we are not seeing 4 ocnsumers but only 2 at a time ..This is our configuration

Abhijeet Kushe

05/21/2023, 1:35 PM

Copy code

{
  "tableName": "workflowEvents",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "eventTimestamp",
    "timeType": "MILLISECONDS",
    "schemaName": "workflowEvents",
    "replicasPerPartition": "4",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "1826",
    "segmentPushType": "APPEND"
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kinesis",
      "stream.kinesis.topic.name": "prod-rel-cdp-dl-workflow-metrics-stream",
      "region": "us-east-1",
      "shardIteratorType": "LATEST",
      "stream.kinesis.consumer.type": "lowlevel",
      "stream.kinesis.fetch.timeout.millis": "30000",
      "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory",
      "realtime.segment.flush.threshold.size": "1000000",
      "realtime.segment.flush.threshold.time": "1h"
    }
  },
  "upsertConfig": {
    "mode": "FULL"
  },
  "routing": {
    "instanceSelectorType": "strictReplicaGroup"
  },
  "metadata": {
    "customConfigs": {}
  }
}

Abhijeet Kushe

05/21/2023, 1:36 PM

We also made a server properties change . Default is 1 https://docs.pinot.apache.org/configuration-reference/server .Is that related ?

Copy code

pinot.server.instance.max.parallel.refresh.threads=3

Sid

05/22/2023, 10:20 AM

Hi Team, I have been trying to implement this groovy function to transform the timestamp column in events: but somehow its not getting saved in table config - throws error - "transformConfigs": [ { "columnName": "event_time", "transformFunction": "groovy('{\"returnType\":\"TIMESTAMP\", \"isSingleValue\":true}','def truncated = event_timestamp.substring(0, event_timestamp.lastIndexOf('.') + 3);return FromDateTime(truncated, 'yyyy-MM-dd''T''HHmmss.SSS')', event_timestamp)" } ] i updated groovy settings in broker, controller, server - restarted all of them. Yet the error shows - Groovy Transform function has been disabled. would appreciated if any insight could be shared.

Tanmay Varun

05/22/2023, 9:58 PM

Hi team, i switched to a new kafka cluster midway, now my pinot servers are not able to read since they are looking at a higher offset, how to reset them

Sonit Rathi

05/23/2023, 4:02 AM

please help. added new columns. pause consumption. reloaded segments. resumed consumption. but segments are not getting created. getting below error

Ehsan Irshad

05/23/2023, 7:07 AM

Hi Team. I am trying to understand the sortedIndex it seems it can add a lot of value. But didnt manage to get any performance benefits I have a few questions. 1. Does it work for realtime table? For both consuming and committed segments? Will it be created for all the segments when I reload the table after modifying the config? 2. Does it only work for offline tables? 3. Will it work automatically for online to offline flow? Or do we need to sort the data first? 4. I want to sort the data based on city col in my data which is a realtime table but can be converted to hybrid.

Jatin

05/23/2023, 9:56 AM

Hi Team I have column update_date which contains 'null' , now i want to get day for it using day(FromDateTime(updated_date, 'yyyy-MM-dd')) but is showing error --> [ { "message": "QueryExecutionError:\nProcessingException(errorCode:450, messageInternalError\njava.lang.NullPointerException)\n\tat org.apache.pinot.common.response.ProcessingException.deepCopy(ProcessingException.java:146)\n\tat org.apache.pinot.common.exception.QueryException.getException(QueryException.java:172)\n\tat org.apache.pinot.common.exception.QueryException.getException(QueryException.java:167)", "errorCode": 200 } ]