Apache Pinot #troubleshooting

Abhishek Tanwade

07/08/2022, 4:23 AM

Hello everyone, can anyone share some documentation on loading data to Pinot table? Apache pinot deployed on Azure Kubernetes service.

Alexander Vivas

07/08/2022, 10:01 AM

good day everyone, does anyone know if pinot has segment backwards compatibility? I’ve been running pinot 0.6.0 and I am thinking of an upgrade to the latest version available which seems to be 0.10.0, do you guys have that sort of migration guide?

Abdullah Jaffer

07/08/2022, 10:24 AM

I have this table config that needs to ingest data from orc files saved in S3, it it's not ingesting any data

Copy code

{
  "OFFLINE": {
    "tableName": "sales_by_order_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "schemaName": "sales_by_order",
      "retentionTimeUnit": "DAYS",
      "retentionTimeValue": "10000",
      "replication": "2",
      "segmentPushFrequency": "HOURLY",
      "segmentPushType": "REFRESH",
      "replicasPerPartition": "1"
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "noDictionaryColumns": [],
      "rangeIndexVersion": 2,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "rangeIndexColumns": []
    },
    "metadata": {},
    "quota": {},
    "task": {
      "taskTypeConfigsMap": {
        "SegmentGenerationAndPushTask": {
          "schedule": "0 * * * * ?",
          "tableMaxNumTasks": "28"
        }
      }
    },
    "routing": {},
    "query": {},
    "ingestionConfig": {
      "batchIngestionConfig": {
        "batchConfigMaps": [
          {
            "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS",
            "input.fs.prop.region": "ap-southeast-1",
            "inputDirURI": "s3 link",
            "includeFileNamePattern": "glob:**/*.orc",
            "excludeFileNamePattern": "glob:**/*.tmp",
            "inputFormat": "orc"
          }
        ],
        "segmentIngestionType": "REFRESH",
        "segmentIngestionFrequency": "HOURLY"
      }
    },
    "isDimTable": false
  }
}

Kevin Liu

07/08/2022, 5:22 PM

Hi folks. I have two questions? 1. Why is RealtimeToOfflineSegmentsTask executed in a single thread, and it is easy to time out due to a large amount of data. Are there any restrictions? 2. Is there any API for converting segment to record (GenericRow) directly from s3?

Alice

07/10/2022, 2:44 PM

Hi team, I’m using lookup function according to this doc https://docs.pinot.apache.org/users/user-guide-query/lookup-udf-join. But query result shows No Record(s) found. I’ve set isDimTable=true and primaryKeyColumns(error_category) in my offline table config. Here is my query. select error_category ,lookUp(dim_table_name, insight_id, error_category, error_category) insight_id from fact_table_name I think I’m not using lookUp function correctly because query without lookUp function, like select error_category from fact_table_name, could return some records. Could somebody be aware of how to config lookUp?

Marlon Félix

07/11/2022, 4:41 PM

Hello everyone! I'm writing an article for medium of a real case implemantation of Apache Pinot as a way of studying. For that, I'm using a Strimzi cluster and the Twitter Api Kafka Connector (that you can find at https://www.confluent.io/hub/jcustenborder/kafka-connect-twitter) running on Minikube, to get data from twitter's api and ingest it into Pinot. I followed the steps explained in this video

https://www.youtube.com/watch?v=Jc03u8rXc2w▾

making some adaptations to run on kubernetes. That way I was able to infer the schema of the "twitter-sample.json" file attached to this message by generating the schema file "twitter-old-schema.json", after that I had to remove some fields: "schema.type", "schema.fields", "schema.optional", "schema.name", and remove the prefix "payload." of every column to generate the file "twitter-schema.json". Then with this schema file and with the table config file "twitter-config.json" I created the REALTIME table "twitter-status-events" (using the column "CreatedAt" as datetime column) using pinot-admin.sh inside pinot controller's pod. But for some reason that I don't know I'm not getting any record in this table, do you have any idea what I'm doing wrong ? (more information replied to this comment)

Marlon Félix

07/11/2022, 4:43 PM

Hello everyone! I'm writing an article for medium of a real case implementation of Apache Pinot as a way of studying. For that, I'm using a Strimzi cluster and the Twitter Api Kafka Connector (that you can find at https://www.confluent.io/hub/jcustenborder/kafka-connect-twitter) running on Minikube, to get data from twitter's api and ingest it into Pinot. I followed the steps explained in this video

https://www.youtube.com/watch?v=Jc03u8rXc2w▾

twitter-schema.json twitter-sample.json twitter-old-schema.json twitter-config.json

harnoor

07/11/2022, 8:10 PM

Hi, Do we have this feature https://github.com/apache/pinot/pull/6120#issue-717507183 documented on pinot docs? (couldn’t find) Is this recommended to speedup regexp queries?

troywinter

07/12/2022, 8:15 AM

Hi all, I got this exception using trino with version 389 and pinot version 0.9.3, how should I resolve this?

Copy code

io.grpc.StatusRuntimeException: UNKNOWN
	at io.grpc.Status.asRuntimeException(Status.java:535)
	at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:648)
	at io.trino.plugin.pinot.client.PinotGrpcDataFetcher$PinotGrpcServerQueryClient$ResponseIterator.computeNext(PinotGrpcDataFetcher.java:266)
	at io.trino.plugin.pinot.client.PinotGrpcDataFetcher$PinotGrpcServerQueryClient$ResponseIterator.computeNext(PinotGrpcDataFetcher.java:253)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
	at io.trino.plugin.pinot.client.PinotGrpcDataFetcher.endOfData(PinotGrpcDataFetcher.java:85)
	at io.trino.plugin.pinot.PinotSegmentPageSource.getNextPage(PinotSegmentPageSource.java:114)
	at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:311)
	at io.trino.operator.Driver.processInternal(Driver.java:410)
	at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
	at io.trino.operator.Driver.tryWithLock(Driver.java:698)
	at io.trino.operator.Driver.process(Driver.java:305)
	at io.trino.operator.Driver.processForDuration(Driver.java:276)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:740)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
	at io.trino.$gen.Trino_389____20220712_080400_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Harish Bohara

07/12/2022, 9:35 AM

Anyone know how to extract data from nested json: Not sure how to extract “data.device” and put it in “device” column.

Copy code

Event coming in kafka:
{
  "user_id": "1234",
  "data": {
    "device": "abcd"
  }
}


Schema I need for table:
{
  {
    "name": "user_id",
    "dataType": "STRING"
  },
  {
    "name": "device",
    "dataType": "STRING"
  },
}

🟢 1

Alice

07/12/2022, 11:14 AM

Hi team, could you help see what’s going on here? I set “replicasPerPartition”: “2" in my table config and assign this table tenant_a(server 6, server-7). Due to limited resource, I migrated this table to tenant_b(server-9). Then one segment has the following status. Based on previous experience, I think data migration needs some time, the segment status will recover good soon. But it seems it’s stuck here this time. Is there anything I can do to fix it?

Alice

07/12/2022, 11:44 AM

Hi, some segments of one pinot table is bad status. When I call /tables/{realtimeTableName}/consumingSegmentsInfo api to see segment consuming info, I found these segments have no consuming info. What’s possible reasons for this error?

Eaugene Thomas

07/13/2022, 11:30 AM

Hi , I was referring

https://youtu.be/cNnwMF0pOJ8▾

for playing around with Pinot setup & data ingestion . When querying, the result between OFFLINE + REALTIME vs combined select query for a table is differing . Can someone help me with some insights on the reasons for this ?

Stuart Millholland

07/13/2022, 4:22 PM

Has anyone had trouble configuring ingestion aggregations via the UI? I add the section and it saves, but it doesn't stick.

Ashish

07/13/2022, 7:22 PM

Is there a behavior change between 0.9.0 and 0.10.0 related to kafka client? In 0.9.0, we should see pinot committing the offset but 0.10.0, pinot never commits the offset to kafka and consumer lag keeps growing.

Harish Bohara

07/13/2022, 8:18 PM

Does anyone know how to only store only single row per day (or per hour) if all the columns are same for a given row? - I get 30-50M rows per day where unique row combinations are < 1000. I want to store one unique combination for each hour. Yes the same row can repeat but in next hour --------------------------------------- e.g. if there are 3 row in 1 hour col_1, col_2, col_3, hour_1 col_1, col_2, col_3, hour_1 col_1, col_200, col_3, hour_1 Rows in DB should be for each hour col_1, col_2, col_3, hour_1 col_1, col_200, col_3, hour_1 e.g. if there are 3 row in 1 hour col_1, col_2, col_3, hour_1 col_1, col_2, col_3, hour_1 col_1, col_200, col_3, hour_1 col_1, col_2, col_3, hour_2 Rows in DB should be for each hour col_1, col_2, col_3, hour_1 col_1, col_200, col_3, hour_2 col_1, col_2, col_3, hour_2

André Siefken

07/14/2022, 8:19 AM

Hi folks, quick question: using the pinot-java-client with broker-list

JsonAsyncHttpPinotClientTransport

am I supposed to reuse a single

Connection

across all query requests, or create a new

Connection

from the

ConnectionFactory

for each query? Or in other words, is the http connection pool held by the

Connection

instance, or the

ConnectionFactory

Deepika Eswar

07/14/2022, 11:13 AM

does pinot support connecting to Tableau for reporting?

Ethan Yu

07/14/2022, 8:33 PM

Hi, so I'm trying to run pinot on a kubernetes cluster and to ingest realtime data from kafka. However, when I try to ingest data pinot seems to fail and stop ingesting data at a specific point. I tested this by running two different pinot kubernetes clusters at the same and having both ingest from kafka at the same time, yet they both also seemed to stop at around exactly the same time. If I run pinot on an individual machine it seems to work but for some reason it does not on kubernetes. The config Im running for pinot is 3 controllers, 30 servers, 1 minion, and 3 zookeepers.

Alice

07/15/2022, 3:57 AM

Hi team, I’ve a question about startree index. If I add more columns in dimensionsSplitOrder in startree index config, and restart servers, will the existing segments recreate startree index based on new startree index config? Or just new segments will create startree index based on new startree config?

Jacob M

07/17/2022, 3:50 PM

hi! in the past, i've always had a primary key where i'm doing some equality filtering in a

where

clause and have used

segmentPartitionConfig

and

bloomFilterColumns

to make sure i'm really only querying a single segment & single server. i'm trying to configure a table to support queries that don't necessarily have any equality clause in the

where

but will always have a time clause, like

where created > X

. i've noticed all my queries hit all the servers and all the segments. am i doing something wrong? i thought time columns had some special handling maybe! (if it helps, this is an offline table)

chandarasekaran m

07/18/2022, 4:02 AM

Hi Team, How I can parse kafka header(in bytes) and filter based on specific field ? any code samples?

Kevin Liu

07/18/2022, 8:11 AM

Hi folks,

Copy code

GenericRowFileWriter class:

  /**
   * Writes the given row into the files.
   */
  public void write(GenericRow genericRow)
      throws IOException {
    _offsetStream.writeLong(_nextOffset);
    byte[] bytes = _serializer.serialize(genericRow);
    _dataStream.write(bytes);
    _nextOffset += bytes.length;
  }

I use GenericRowFileWriter to write GenericRow to the record.data file. It takes several hours to write more than 20 million data. Why is it so slow to write?

shivam

07/18/2022, 12:37 PM

We are getting this error on one of our brokers,

Copy code

{
  "id": "10008ad94fa0022__brokerResource",
  "simpleFields": {},
  "mapFields": {
    "HELIX_ERROR     20220718-101026.000050 STATE_TRANSITION 0dc81776-5d32-452b-8c82-ae66fd33a5e6": {
      "AdditionalInfo": "Exception while executing a state transition task span_event_view_REALTIMEjava.lang.reflect.InvocationTargetException\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.IllegalStateException: Failed to find table config for table: span_event_view_REALTIME\n\tat shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:518)\n\tat org.apache.pinot.broker.routing.RoutingManager.buildRouting(RoutingManager.java:304)\n\tat org.apache.pinot.broker.broker.helix.BrokerResourceOnlineOfflineStateModelFactory$BrokerResourceOnlineOfflineStateModel.onBecomeOnlineFromOffline(BrokerResourceOnlineOfflineStateModelFactory.java:80)\n\t... 12 more\n",
      "Class": "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",
      "MSG_ID": "dfc7f986-0406-49fc-b6f4-5101630efb17",
      "Message state": "READ"
    },
    "HELIX_ERROR     20220718-101026.000081 STATE_TRANSITION e65c8151-76d4-4267-83ad-48dabdd66eae": {
      "AdditionalInfo": "Message execution failed. msgId: dfc7f986-0406-49fc-b6f4-5101630efb17, errorMsg: java.lang.reflect.InvocationTargetException",
      "Class": "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",
      "MSG_ID": "dfc7f986-0406-49fc-b6f4-5101630efb17",
      "Message state": "READ"
    }
  },
  "listFields": {}
}

Quick fix: We have restarted our brokers. but still not clear what went wrong, Need help! //@harnoor

Stuart Coleman

07/18/2022, 1:40 PM

hi - we have two pinot tables both consuming from the same Kafka topic. Both are using the low level consumer. One is a hybrid table and one is a realtime only table. We have an issue where one record is missing from the realtime table but is present in the hybrid table. We have looked in the logs and can see no Warn or Error messages at the time the record was lost. The only log of interest is that we have an idle consumer at that time and the stream is recreated. Are there any known scenarios in which message loss is possible?

Priyank Bagrecha

07/18/2022, 5:25 PM

Hello, we have a use case for an offline table but we don't have a time column for segment config. What is the suggested route if there is one? Thank you!

Abhijeet Kushe

07/18/2022, 6:30 PM

My hope was to the see existing segments being repartitioned via accountId on different instances and the number or replicasPerPartition to increase to 3 from 1.However I did not seem any changes in the existing segments not did I see change in replicas not did I see any new props added to the segments https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing

Copy code

column.accountId.partitionFunction = Module
column.accountId.numPartitions = 4
column.accountId.partitionValues = 1

Mayank

07/18/2022, 6:31 PM

You have replication of 1, and you are also requesting 1 replica to be up at all time, so rebalancer won’t be able to work.

Mayank

07/18/2022, 6:32 PM

Also, recommend to use Murmur rather than modulo

Abhijeet Kushe

07/18/2022, 6:34 PM

I see sure I can use Murmur.. I also tried to call the rebalance endpoint with

minAvailableReplicas: 0

but I now get the following message

Copy code

Instance reassigned, table is already balanced