Apache Pinot #troubleshooting

Shreeram Goyal

03/17/2023, 11:15 AM

Hi, I am using pinot release v0.12.0 and have set my timeboundary as max value of timecolumn of the offline segments using swagger api:

POST /tables/{tableName}/timeBoundary

. I tried querying the data residing in offline servers on both pinot query console and presto. On querying, I found that while I get the correct data on pinot query console, the last row is missing on presto. Can someone please help me understand and debug this?

himanshu yadav

03/17/2023, 1:56 PM

Hi, has anyone ever tried to bootstrap realtime upsert table using flink our pinot version is 0.11.0 we are facing this issue https://apache-pinot.slack.com/archives/C01S5EHPS2U/p1679035695669099

Jun

03/19/2023, 4:17 PM

Hi Team, I found a potential flaky test, could anyone help me confirm that? https://github.com/apache/pinot/issues/10442 (Sorry I should not have post this to #CDRCA57FC)

Varagini Karthik

03/20/2023, 9:59 AM

Hi All, I'm trying to execute the TEXT_MATCH
from Trio on Pinot table...... Iam getting the following error

trino error: line 4:10: Function 'text_match' not registered

this is my query

Copy code

Select *
   from pinot.default.jobTitles
   where TEXT_MATCH(jobTitle, 'Java Developer')

[10:40 AM] Any idea how to resolve this ... [10:40 AM] Trino version 403 Pinot Version 0.10.0

Rajat Yadav

03/20/2023, 5:01 PM

How to enable V2 multi-stage engine in pinot. Can anyone please share the steps and where to add the configurations in helm charts?

Lewis Yobs

03/20/2023, 5:06 PM

<https://docs.pinot.apache.org/developers/advanced/v2-multi-stage-query-engine#how-to-enable-the-multi-stage-query-engine>

Sid

03/20/2023, 6:43 PM

Hi team, been exploring apache pinot for the first time. I'm unable to make the filter function work on pinot tables consuming events from kafka. I wanted to filter events based on event_names field in each kafka event. I get the below error and I tried setting up the Groovy field in controller.conf file, still no luck. org.apache.pinot.segment.loca^Cjava.lang.RuntimeException: Caught exception while executing filter function: Caused by: java.lang.NumberFormatException: For input string: "{event_name}" Any help would be appreciated.

Rajat Yadav

03/21/2023, 5:54 AM

Hi team, I am executing this query through V2 multi-stage engine:

Copy code

SELECT count(*)
FROM
  (Select COUNT(*)
   from users where country IN ('INDIA')) AS virtual_table
LIMIT 1000;

But i am getting the following error:

Copy code

[
  {
    "message": "TableDoesNotExistError",
    "errorCode": 190
  }
]

Even though the table is there. Does anyone know why it is happening.??

arun udaiyar

03/21/2023, 7:53 AM

Hi Team, I am using helm chart to run the pinot on kubernetes cluster, now i have one requirement that i need to add java jks file into the container, what is the best way i can follow.

Rajat Yadav

03/21/2023, 9:49 AM

Hi team, do we have any configuration to enable only V2 multi-stage engine in pinot. @Mayank @guru

Shreeram Goyal

03/21/2023, 5:41 PM

I keep getting this error while running query using presto even though I have port opened for grpc @Mayank @Xiang Fu:

Copy code

io.grpc.StatusRuntimeException: UNKNOWN
	at io.grpc.Status.asRuntimeException(Status.java:535)
	at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:648)
	at com.facebook.presto.pinot.PinotSegmentPageSource.getNextPage(PinotSegmentPageSource.java:204)
	at com.facebook.presto.operator.ScanFilterAndProjectOperator.processPageSource(ScanFilterAndProjectOperator.java:295)
	at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:260)
	at com.facebook.presto.operator.Driver.processInternal(Driver.java:426)
	at com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:309)
	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:730)
	at com.facebook.presto.operator.Driver.processFor(Driver.java:302)
	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:166)
	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:599)
	at com.facebook.presto.$gen.Presto_0_279_686ef1d____20230309_045351_1.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Sid

03/21/2023, 6:19 PM

Hi Team, what sort time column should be created when timestamp from kafka events keeps changing in format: few examples are: 2023-03-21T110317.55803331Z 2023-03-09T101458.656Z 2023-03-09T101523.137+00:00 How to standardize this in schema.

Jack Luo

03/21/2023, 8:33 PM

Hi Team. I noticed that the native text index seems to allocate memory on the heap rather than using memory mapped pages. In our deployment, with native text-index enabled, the entire heap (192GB) will be consumed every half an hour under our use-case. We had to revert back to the legacy lucene based text-index. Does pinot team plan to support file-backed memory (MMAP) for native-text index in the future?

Rajat Yadav

03/22/2023, 8:14 AM

How to delete the old tasks in minion I am seeing those tasks again and again and not able to push new data. Pinot version: 0.10

Shreeram Goyal

03/22/2023, 12:52 PM

Hi, I am facing a few issues on querying via presto majorly on offline servers which have major chunk of our data with some tables having 30G of data. I have 6 servers with 32G RAM and have configured 2RGs with 3 servers each. The issues are: 1. I am running heavy queries which are directly routed to servers without involving brokers (checked using explain plan) via presto and am facing memory issues where memory isn't getting released after the query is complete and eventually leading to server going down on another query. I have tried different configs for heap and direct memory and currently my configs are: xmx=16G and DirectMemory=12G. 2. On running multiple queries together, they are all routed to a single RG via presto. This shouldn't be the case ideally or correct me if I am wrong. Would be great if I could get some insights on the potential causes and workarounds if any other than vertical scaling!

03/22/2023, 7:18 PM

Hi, I asked this in #C016ZKW1EPK earlier and @Mayank suggested I post here instead https://apache-pinot.slack.com/archives/C016ZKW1EPK/p1679508026624729

Sid

03/23/2023, 6:14 AM

Hi Team, was discussing with @saurabh dubey and here are few suggestions would appreciate if can be considered to have on Pinot. • API support for schema generation from sample json file. • Logs for pinot servers on UI or through API. Currently during PoC of Pinot I have to check docker logs to see if stream ingestion has any issue.

👍 1

Bharath

03/23/2023, 9:11 AM

Hello..... I'm looking for some help related to apache pinot zookeeper...... My dev team is using pinot for querying data sets. However, the team needs a zookeeper URL to be used to make calls from java application. Currently the

pinot-controller

is exposed for accessing UI from AWS EKS cluster. So exposing

piniot-zookepeer

similar to pinot-controller would work in this use case? Just not sure about it, so wanted to get a clarification. The Apache Pinot is setup using this on AWS EKS. https://docs.pinot.apache.org/basics/getting-started/kubernetes-quickstart (edited) docs.pinot.apache.org Running in Kubernetes Pinot quick start in Kubernetes

Tamás Nádudvari

03/23/2023, 12:40 PM

Hi, I ran into a problem when I tried to upgrade from 0.11.0 to 0.12.0. Right after the controller restarted with 0.12.0 it started to throw exceptions about unable to get the consuming segments info for our hybrid table. Did anyone else run into something like this?

Rajat Yadav

03/23/2023, 1:34 PM

Hi team, while running queries from superset we are getting error that: 2 out of 4 servers responded the dataset is very large around 700million. we have 4 servers [1core, 15G memory] Does anyone know is this infra issue or query proccessing error??

Rajat Yadav

03/23/2023, 3:16 PM

Hi team, We have an existing OFFLINE table and we want to load more segments to that table. Is there any way to do that?

Mark Needham

03/23/2023, 3:50 PM

yes - you should be able to load more segments the same way you did the initial ones? Just you need to make sure the name of those segments doesn’t clash with the ones you already have

Zhuangda Z

03/23/2023, 7:42 PM

Hi folks, I ran into a deserializing problem where it doesn’t support parsing a JSON col

abhinav wagle

03/24/2023, 3:04 AM

Hellos any pointers on how to fix :

BROKER_SEGMENT_UNAVAILABLE_ERROR_CODE

: 305 Error https://github.com/apache/pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/common/exception/QueryException.java#L67

Malte Granderath

03/24/2023, 11:09 AM

Hey 👋 Is there any way yet to upload segments from the minion directly to the deep store? I saw this thread but maybe something has changed since then

Bharath

03/24/2023, 12:00 PM

Hello everyone. Does anyone know how to create the zookeeper url that is running in AWS EKS? Need this zookeeper url string to connect to pinot cluster. I tried several methods but no luck. In short I want to expose the URL with a LoadBalancer. However, the pod failed health checks repeatedly so I couldn't essentially get a LoadBalancer url for zookeeper. Any help would be great. Below is the snippet from the documentation page. https://docs.pinot.apache.org/users/clients/java

Sid

03/24/2023, 1:54 PM

Hi, Does anyone know how to reduce the number of segments on realtime table. Currently it keeps increasing by the amount of number of kafka events partition, making the queries slow. The merge roll up also is not working on realtime table.

Utsav kansara

03/25/2023, 1:28 AM

Hi Guys, I am trying to call controller endpoint to enable logging as per: https://docs.pinot.apache.org/operators/operating-pinot/managing-logs Though for some reason it keeps failing with following exception.

Copy code

Mar 25, 2023 12:51:08 AM org.glassfish.jersey.internal.Errors logErrors
WARNING: The following warnings have been detected: WARNING: Unknown HK2 failure detected:
MultiException stack 1 of 3
org.glassfish.hk2.api.UnsatisfiedDependencyException: There was no object available for injection at SystemInjecteeImpl(requiredType=LoggerFileServer,parent=PinotControllerLogger,qualifiers={},position=-1,optional=false,self=false,unqualified=null,1825910288)
	at org.jvnet.hk2.internal.ThreeThirtyResolver.resolve(ThreeThirtyResolver.java:51)
	at org.jvnet.hk2.internal.ClazzCreator.resolve(ClazzCreator.java:188)
	at org.jvnet.hk2.internal.ClazzCreator.resolveAllDependencies(ClazzCreator.java:211)
	at org.jvnet.hk2.internal.ClazzCreator.create(ClazzCreator.java:334)
	at org.jvnet.hk2.internal.SystemDescriptor.create(SystemDescriptor.java:463)
	at org.glassfish.jersey.inject.hk2.RequestContext.findOrCreate(RequestContext.java:59)
	at org.jvnet.hk2.internal.Utilities.createService(Utilities.java:2102)
	at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetService(ServiceLocatorImpl.java:758)
	at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetService(ServiceLocatorImpl.java:721)
	at org.jvnet.hk2.internal.ServiceLocatorImpl.getService(ServiceLocatorImpl.java:691)
	at org.glassfish.jersey.inject.hk2.AbstractHk2InjectionManager.getInstance(AbstractHk2InjectionManager.java:160)
	at org.glassfish.jersey.inject.hk2.ImmediateHk2InjectionManager.getInstance(ImmediateHk2InjectionManager.java:30)
	at org.glassfish.jersey.internal.inject.Injections.getOrCreate(Injections.java:105)
	at org.glassfish.jersey.server.model.MethodHandler$ClassBasedMethodHandler.getInstance(MethodHandler.java:260)
	at org.glassfish.jersey.server.internal.routing.PushMethodHandlerRouter.apply(PushMethodHandlerRouter.java:51)
	at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:86)
	at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:89)
	at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:89)
	at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:89)
	at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:89)
	at org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:69)
	at org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:38)
	at org.glassfish.jersey.process.internal.Stages.process(Stages.java:173)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:247)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
	at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356)
	at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:200)
	at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569)
	at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549)
	at java.base/java.lang.Thread.run(Thread.java:829)
MultiException stack 2 of 3
java.lang.IllegalArgumentException: While attempting to resolve the dependencies of org.apache.pinot.controller.api.resources.PinotControllerLogger errors were found

Sid

03/25/2023, 6:45 AM

Hi Team, my segement generation and push task is in no_started state for a while now: Here is the table config file. What am i missing here. { "OFFLINE": { "tableName": "fullfillment_created_schema_OFFLINE", "tableType": "OFFLINE", "segmentsConfig": { "schemaName": "fullfillment_created_schema", "replication": "1", "replicasPerPartition": "1", "segmentPushType": "APPEND", "timeColumnName": "event_timestamp", "minimizeDataMovement": false, "segmentPushFrequency": "DAILY" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [], "noDictionaryColumns": [], "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "sortedColumn": [], "bloomFilterColumns": [], "loadMode": "MMAP", "onHeapDictionaryColumns": [], "varLengthDictionaryColumns": [], "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false, "optimizeDictionary": false, "optimizeDictionaryForMetrics": false, "noDictionarySizeRatioThreshold": 0, "rangeIndexColumns": [], "rangeIndexVersion": 2 }, "metadata": {}, "quota": {}, "task": { "taskTypeConfigsMap": { "SegmentGenerationAndPushTask": { "schedule": "0 */10 * * * ?" } } }, "routing": {}, "query": {}, "ingestionConfig": { "batchIngestionConfig": { "batchConfigMaps": [ { "inputFormat": "json", "inputFormat": "json", "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS", "fs.prop.region": "ap-south-1", "fs.prop.accessKey": "asdasdasd", "fs.prop.secretKey": "asdasdas", "inputDirURI": "s3://json-prod/FulfillmentCreateFailedEvent/event_date=2023-03-24/", "includeFileNamePattern": "glob:**/*.json.gz" } ], "consistentDataPush": false }, "continueOnError": false, "rowTimeValueCheck": false, "segmentTimeValueCheck": true }, "isDimTable": false } }

Jack Luo

03/25/2023, 8:49 AM

Hi Team, I have an aggregation query like the following:

Copy code

EXPLAIN PLAN FOR SELECT 
  zone, 
  count(*) 
FROM 
  "table" 
WHERE 
  (
    _timestampMillis <= 1679691885000 
    AND _timestampMillis > 1679432712000
  ) 
  AND (
    text_match(
      "json_data", '"instance*33554433"'
    ) 
    AND json_extract_scalar(
      "json_data", '$.instance', 'INT', 
      0
    ) = 33554433
  ) 
GROUP BY 
  zone 
ORDER BY 
  count(*) desc 
LIMIT 
  10

The goal is to perform exact match of JSON documents by first perform a fuzzy

text_match

and then perform

json_extract_scalar

only on the matching rows. The reason for using approach to search JSON rather than leverage the JSON index is because of much lower memory usage + disk usage, i.e. JSON index is too expensive. However, the default query planner's behavior is not ideal. Although

text_match

alone returns result double digit milliseconds,

text_match

json_extract_scalar

returns results 75-100x slower. The root cause I believe is that Pinot's query planner decides to execute

text_match

and

json_extract_scalar

concurrently rather than one after another. The actual query plan is as follows:

Copy code

{
    "rows": [
      [
        "BROKER_REDUCE(sort:[count(*) DESC],limit:10)",
        1,
        0
      ],
      [
        "COMBINE_GROUP_BY",
        2,
        1
      ],
      [
        "PLAN_START(numSegmentsForThisPlan:52)",
        -1,
        -1
      ],
      [
        "GROUP_BY(groupKeys:zone, aggregations:count(*))",
        3,
        2
      ],
      [
        "TRANSFORM_PASSTHROUGH(zone)",
        4,
        3
      ],
      [
        "PROJECT(zone)",
        5,
        4
      ],
      [
        "DOC_ID_SET",
        6,
        5
      ],
      [
        "FILTER_AND",
        7,
        6
      ],
      [
        "FILTER_TEXT_INDEX(indexLookUp:text_index,operator:TEXT_MATCH,predicate:text_match(json_data,'\"instance*33554433\"'))",
        8,
        7
      ],
      [
        "FILTER_RANGE_INDEX(indexLookUp:range_index,operator:RANGE,predicate:(_timestampMillis > '1679432712000' AND _timestampMillis <= '1679691885000'))",
        9,
        7
      ],
      [
        "FILTER_EXPRESSION(operator:EQ,predicate:jsonextractscalar(json_data,'$.instance','INT','0') = '33554433')",
        10,
        7
      ]
    ]
  },
}

The optimized query plan for our use case should be the following:

Copy code

{
    "rows": [
      [
        "BROKER_REDUCE(sort:[count(*) DESC],limit:10)",
        1,
        0
      ],
      [
        "COMBINE_GROUP_BY",
        2,
        1
      ],
      [
        "PLAN_START(numSegmentsForThisPlan:52)",
        -1,
        -1
      ],
      [
        "GROUP_BY(groupKeys:zone, aggregations:count(*))",
        3,
        2
      ],
      [
        "TRANSFORM_PASSTHROUGH(zone)",
        4,
        3
      ],
      [
        "PROJECT(zone)",
        5,
        4
      ],
      [
        "DOC_ID_SET",
        6,
        5
      ],
      [
        "FILTER_AND",
        7,
        6
      ],
      [
        "FILTER_EXPRESSION(operator:EQ,predicate:jsonextractscalar(json_data,'$.instance','INT','0') = '33554433')",
        8,
        7
      ],
      [
        "FILTER_AND",
        9,
        8
      ],
      [
        "FILTER_RANGE_INDEX(indexLookUp:range_index,operator:RANGE,predicate:(_timestampMillis > '1679432712000' AND _timestampMillis <= '1679691885000'))",
        10,
        9
      ],
      [
        "FILTER_TEXT_INDEX(indexLookUp:text_index,operator:TEXT_MATCH,predicate:text_match(json_data,'\"instance*33554433\"'))",
        11,
        9
      ]
    ]
  },
}

Does Pinot team have any plan to implement this optimization in the near future? If not, would Pinot team be interested in a pull request to optimize this query?

➕ 1