Apache Pinot #troubleshooting

Sadim Nadeem

07/05/2021, 3:53 PM

@Mayank @Xiang Fu @Jackie@Kishore G @Daniel Lavoie @Ken Krugler @Neha Pawar.. we are actually trying upsert above here as mentioned by @Radhika...but the table count is coming up as zero .. we have followed all the steps required as mentioned here https://docs.pinot.apache.org/basics/data-import/upsert .. we are using Apache samza API(check attached code snippet) for partition by as mentioned in the above doc for:- Partition the input stream by the primary key An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the

send

API. If the original stream is not partitioned, then a streaming processing job (e.g. Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.

upsert samza java streaming app snippet.txt

Sadim Nadeem

07/05/2021, 3:58 PM

we can see this streaming application publishing data on the output topic on top of which this pinot upsert table is created and before writing the data . using samza API .. we are shuffling the data to push data with same key on same partition .. still the data is not ingested by pinot and table count comes as zero . the table schema and table creation script are shared above by @Radhika

Carlos Domínguez

07/08/2021, 9:40 PM

Hi guys!

Carlos Domínguez

07/08/2021, 9:40 PM

I have a question regarding Kafka integration with Pinot

Carlos Domínguez

07/08/2021, 9:42 PM

Thanks in advance!

Prashant Pandey

07/13/2021, 5:47 AM

Hi everyone, good morning 🙂. I need some help debugging slow queries in our Pinot cluster. We are running the following query:

Copy code

Select api_id, service_name, service_id, api_name, COUNT(*) FROM myTable WHERE tenant_id = 'someTenantId' AND ( api_id IS NOT NULL AND start_time_millis >= 1625039026768 AND start_time_millis < 1625643826768 ) GROUP BY api_id, service_name, service_id, api_name ORDER BY PERCENTILETDIGEST99(duration_millis) desc  limit 10000

And these are the query stats:

Copy code

timeUsedMs: 1077
numDocsScanned: 560325713
totalDocs: 3103044892
numServersQueried: 8
numServersResponded: 8
numSegmentsQueried: 623
numSegmentsProcessed: 115
numSegmentsMatched: 115
numConsumingSegmentsQueried: 4
numEntriesScannedInFilter: 25000000
numEntriesScannedPostFilter: 2801628565
numGroupsLimitReached: false
partialResponse: -
minConsumingFreshnessTimeMs: 1626154723247

The most conspicuous of these stats is

numEntriesScannedInFilter

. The troubleshooting guide says that if this number is too high, we should consider adding an index on the column, While we don’t have an index on this, our segment config is:

Copy code

"segmentsConfig": {
      "timeType": "MILLISECONDS",
      "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
      "timeColumnName": "start_time_millis",
      "retentionTimeUnit": "DAYS",
      "retentionTimeValue": "7",
      "replicasPerPartition": "1",
      "schemaName": "rawServiceView"
    }

As you can see, the

timeColumnName

start_time_millis

and therefore, we haven’t added any index on this column (our reasoning is that segments would be pruned on this column anyway so we don’t need an extra index).

myTable

is a real-time table. If I remove the filter on

start_time_millis

, then

numEntriesScannedInFilter

becomes 0. What are we doing wrong here?

Kishore G

07/13/2021, 6:20 AM

there is nothing wrong, its working as expected

Kishore G

07/13/2021, 6:20 AM

a segment is either • no match • full match

Kishore G

07/13/2021, 6:20 AM

• partial match

Kishore G

07/13/2021, 6:21 AM

no match or full match will not add to numEntriesScannedInFilter

Kishore G

07/13/2021, 6:21 AM

but the partial ones will have to scan to evaluate the time filter

Kishore G

07/13/2021, 6:22 AM

if you want to bring this further down, you can try range index on time column

Prashant Pandey

07/13/2021, 6:38 AM

Thanks for the reply @Kishore G we’ll add an index on start_time_millis and report back.

Kishore G

07/13/2021, 6:41 AM

it should be range index

Prashant Pandey

07/13/2021, 6:45 AM

Yes, adding a range index only @Kishore G. We already have inverted indices for the other two fields. Let me test it and report back.

Bruce Ritchie

07/13/2021, 10:01 PM

So, um, the dependencies seem to be a wee bit out of date for some things. Hitting this attempting to run the spark job on EMR 6.3.0/jdk 11: https://issues.apache.org/jira/browse/LANG-1384

Copy code

Caused by: java.lang.NullPointerException
        at org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(SystemUtils.java:1626)
        at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207)
        at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)

Saurabh Dwivedy

07/14/2021, 11:32 AM

hello

Saurabh Dwivedy

07/14/2021, 11:32 AM

I am trying to follow the steps outlined on the link https://docs.pinot.apache.org/basics/data-import/batch-ingestion for setting up a schema and table data in Pinot

Saurabh Dwivedy

07/14/2021, 11:33 AM

I am able to upload the schema and the table structure using the commands bin/pinot-admin.sh AddTable \\ -tableConfigFile /path/to/table-config.json \\ -schemaFile /path/to/table-schema.json -exec

Saurabh Dwivedy

07/14/2021, 11:34 AM

But when I am trying to load the csv file data into the table using bin/pinot-admin.sh LaunchDataIngestionJob \\ -jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yml

Saurabh Dwivedy

07/14/2021, 11:34 AM

I am getting error as follows "Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 0: <http://localhost:9000>/tables/transcript/schema"

Saurabh Dwivedy

07/14/2021, 11:34 AM

I am unable to understand why - I am doing nothing special - just following the steps outlined in the document

Saurabh Dwivedy

07/14/2021, 11:34 AM

Can anyone help me with this issue

Saurabh Dwivedy

07/14/2021, 12:07 PM

I’m running Pinot on local Mac

Saurabh Dwivedy

07/14/2021, 12:07 PM

Not on spark etc

Saurabh Dwivedy

07/14/2021, 1:56 PM

that's why it was unable to locate the file and giving the error accordingly.

Saurabh Dwivedy

07/14/2021, 1:56 PM

Pinot is amazing

Luiz Gabriel Lima Pinheiro

07/14/2021, 2:59 PM

Hello! I am trying to create or update a job spec over http. Is there any

LaunchDataIngestionJob

http endpoint to be called? I could not find in the swagger interface to upload the jobSpec yaml file.

Kishore G

07/14/2021, 3:22 PM

@Luiz Gabriel Lima Pinheiro is this for production or poc?