Apache Pinot #troubleshooting

Join Slack

Priyank Bagrecha

06/09/2022, 10:17 PM

has anyone tried using cubejs to query pinot table?

Nikhil Varma

06/10/2022, 5:00 AM

Hi all Is anyone using pinot locally and using pinot-controller.service file for systemd?

Alice

06/10/2022, 11:27 AM

Hi, Could starttree index be applied to OFFLINE table?

Stuart Millholland

06/10/2022, 5:45 PM

So we have data that contains an event_id that is obviously very high cardinality, a unique record per row. The thing is, we don't really have any use at all for the field other than for audit purposes so we're considering getting rid of it. But let's say we did keep it and we employee the pinot managed offline flows to roll things up. Can you ignore a field during rollup so that you keep the detail for a time but as you move to offline you can default it to something so that it basically gets rolled up?

Ali Atıl

06/13/2022, 1:22 PM

Hello everyone, is it possible to update loadMode after table creation? Thanks

AHMEDSHEHATA

06/13/2022, 3:47 PM

Hello Guys, I am trying to use Pinot with presto connector most of things work smooth but I have two issues accessing my tables : When I query with agg I get trino error: Segment query returned '50001' rows per split, maximum allowed is '50000' rows. with query .. I tried config as pinot.segments-per-split= connector config also I tried to "realtime.segment.flush.autotune.initialRows" AS TABLE CONFIG no hope ! the second issues I need is guidance about getting table metadata basically I use pinot in super set and the table metadata fails

Lars-Kristian Svenøy

06/13/2022, 7:31 PM

Hello team 👋 I have a job which builds segments and published them to deep store. Currently I have a problem where since I am building so many segments, I am bottlenecking on IOPS/throughput on my volumes. Is there any way to build the segments completely in memory without flushing to disk? I have been looking around the APIs but can't seem to find anything like that. Any help appreciated 🙏

Nikhil

06/14/2022, 12:39 AM

👋 hi folks, Question on Zookeeper setup for pinot cluster. We have set up a pinot cluster (2 controllers, 2 brokers, 10 servers) with a single zookeeper instance running on a standalone server with EBS volume deployed using the ZK packaged in the pinot binary. The service was started via

pinot-admin.sh StartZookeeper

. What's the recommended practice for deploying a fault tolerant ZK setup? How to set up the metadata sync across multiple ZK instances?

Shreesh Mansotra

06/14/2022, 7:18 AM

Hi folks, I am working on connecting my Pinot database with Superset .I am able to connect my database with superset and is able to run query in the SQL Editor section of the superset UI but unable to create charts and access the Explore section of the UI . Is it due the fact that I have JSON column in the table ? plz help

Lovin Singla

06/14/2022, 11:06 AM

Hello team, I am trying to use a 3-node zookeeper cluster with Pinot. I need to connect it with a Java application in a way such that if one node is not available, it automatically connects using the second node. Currently using "ConnectionFactory.fromZookeeper(zkurl)" and it takes only one zk url and tries infinitely to get a connection via it. Can anyone please help me with a way to achieve a failsafe connection mechanism? Thanks

Akash Yadav

06/14/2022, 12:31 PM

Hi @everyone, i was exploring pinot for one of my use case i want to create segment of users based on their actions eg- users who have done payment using credit card with amount>100 at least 2 times -> segment 1

Copy code

my event payoad would be something like this
{
"eventid": "xyz",
"userId" : "abc",
"amount" : "100",
"paymode" : "CC"
}

table - 
CREATE TABLE pay_events (
	event_id serial PRIMARY KEY,
	user_id VARCHAR (50) NOT NULL,
	amount int not null,
	paymode varchar(4) not null
);

The query for getting the data from pino would be something like this

select pe.user_id from pay_events as pe where pe.paymode = 'CC' and pe.amount > 100 group by pe.user_id having count(pe.event_id) > 1;

A segment can have a million users we need to extract all the users somehow and that is going to be used by downstream services for sending bulk campaigns and notifications My questions are- 1.is pinot the right choice for this use case? 2.is there any scalable way of fetching all the users other than pagination through limit and offset? Do let me know if you guys need any clarification Thanks😀

Satyam Raj

06/14/2022, 6:03 PM

hey guys, we’re using the pinot-java-client to fire point queries on a pinot table on user_id (inverted indexed). when the api that has this client is hit with a lot of requests (200rpm) it causes the application to crash because of too many open threads. just wanted to check if the queries are async or sync in nature. implemented like below

Copy code

Connection connection = ConnectionFactory.fromHostList(this.pinotConfig.getBrokerUrl());
ResultSetGroup resultSet = connection.execute(new Request("sql",query));

Priyank Bagrecha

06/15/2022, 12:32 AM

I asked it in #CDRCA57FC but didn't have much luck so trying it here. What is the recommended approach for batch ingestion of data from let's say either S3 or Hive into Pinot between minion based ingestion v/s standalone ingestion v/s ingestion via a spark job? Are there any pros / cons between the three?

Priyank Bagrecha

06/15/2022, 12:40 AM

Has anyone tried deploying Pinot over a service mesh like envoy or istio? If yes, could you please share any learnings? Even better if you are integrating with Presto or Looker.

Grace Walkuski

06/15/2022, 4:54 PM

Hi! I’m trying to update my retentionTimeValue for a table, and the endpoint doesn’t seem to be working, can someone help? The body on the left has 742 days, the call is successful, but the returned response still has 738 days…

Prashant Pandey

06/16/2022, 6:48 AM

Hi team, my Pinot servers are consuming from topic but very slowly (one table has a rate of 22 events/s). This is despite high consumer lag in these topics AND the servers’ resources being under-utilised. What might be the reason behind this?

Visar Buza

06/16/2022, 7:59 AM

Hi everyone, Me and my team have set up a Pinot cluster, and started testing it out. To speed up the queries we configured a star tree index. I am not seeing as much of a performance upgrade for the percentileest and percentiletdigest functions compared to avg or sum functions, the latter performs much better. I was wondering as how does the star tree actually stores the percentileest, does it store the whole data structure and then checks that for the specific percentile for example 95th or does it do something else. I’d appreciate the help or advice or if there is some documentation about this that I missed.

harnoor

06/16/2022, 12:48 PM

Copy code

numSegmentsQueried=21380, numSegmentsProcessed=90

Hi, we haven’t set any bloom filters and don’t use any partitioning. The query has a time range filter. I just wanted to confirm that in the above example, time-based pruning of segments happened at the broker level and after the broker layer, only 90 segments were queried in the server, right? And if we set bloom filters, are the segments pruned at the server or broker?

Kevin Peng

06/16/2022, 8:07 PM

Hi, I just started playing with ingesting data so I decided to use the query console and modify the example one given

Copy code

INSERT INTO "baseballStats"
FROM FILE '<s3://my-bucket/public_data_set/baseballStats/rawdata/>'
OPTION(taskName=myTask-s3)
OPTION(input.fs.className=org.apache.pinot.plugin.filesystem.S3PinotFS)
OPTION(input.fs.prop.accessKey=my-key)
OPTION(input.fs.prop.secretKey=my-secret)
OPTION(input.fs.prop.region=us-west-2)

I just changed the table name, location and aws credentials, but when I run it I get

Copy code

[
  {
    "message": "QueryExecutionError:\nshaded.org.apache.commons.httpclient.HttpException: Unable to get tasks states map. Error code 400, Error message: {\"code\":400,\"error\":\"No task is generated for table: segments_aggregated, with task type: SegmentGenerationAndPushTask\"}\n\tat org.apache.pinot.common.minion.MinionClient.executeTask(MinionClient.java:123)\n\tat org.apache.pinot.core.query.executor.sql.SqlQueryExecutor.executeDMLStatement(SqlQueryExecutor.java:95)\n\tat org.apache.pinot.controller.api.resources.PinotQueryResource.executeSqlQuery(PinotQueryResource.java:120)\n\tat org.apache.pinot.controller.api.resources.PinotQueryResource.handlePostSql(PinotQueryResource.java:100)",
    "errorCode": 200
  }
]

I am also seeing this in the terminal

2022/06/15 02:02:05.604 ERROR [JobDispatcher] [HelixController-pipeline-task-QuickStartCluster-(e322dd58_TASK)] Job configuration is NULL for TaskQueue_SegmentGenerationAndPushTask_Task_SegmentGenerationAndPushTask_cafd03d1-f383-48ba-aea6-bc0a934522db_1655153184751

Any ideas of what I am doing wrong or where I can go to dig in more? I am running this off the latest docker image for pinot.

Stuart Millholland

06/17/2022, 2:10 PM

In the offline flows configuration, I'm using a schedule to run the job, right now in my local test env it's set to run every minute. I'm interested in hearing from anyone w/ production experience on running that job and how often they decide to run it. The docs say "frequently is better, as extra tasks will not be scheduled unless required". We are planning on ingesting ~30GB of data per day, resulting in about 100 segments. Our time window for rollup is 1d so those segments older than 1d will be rolled up and moved to the offline table. So just curious about thoughts on how often we should run that rollup task.

Norman he

06/17/2022, 8:19 PM

how do i know the timestamp in realtime table my data is ingested? [1:18 PM] is there hidden timestamp i can access ? if not what is the best way to track the timestamp it is available in pinot realtime?

Stuart Millholland

06/20/2022, 7:29 PM

Another question on managed offline flows. We are getting the exact results we want in our OFFLINE table, data nicely rolled up. What we're seeing though is the REALTIME segment is not getting destroyed. The logs say: Trying to destroy segment : immutable_events__0__0__20220620T1729Z, and there's no indication that anything failed that I can find, but the REALTIME segment is still there.

abhinav wagle

06/20/2022, 11:06 PM

hellos, I am trying to build pinot codebase using

docker-build.sh

and running into following issue. Any pointers on how to get around this issue :

Copy code

executor failed running [/bin/sh -c git clone ${PINOT_GIT_URL} ${PINOT_BUILD_DIR} &&     cd ${PINOT_BUILD_DIR} &&     git checkout ${PINOT_BRANCH} &&     mvn install package -DskipTests -Pbin-dist -Pbuild-shaded-jar -Djdk.version=${JDK_VERSION} -T1C &&     mkdir -p ${PINOT_HOME}/configs &&     mkdir -p ${PINOT_HOME}/data &&     cp -r build/* ${PINOT_HOME}/. &&     chmod +x ${PINOT_HOME}/bin/*.sh]: exit code: 1

Alice

06/20/2022, 11:48 PM

Hi, team. I want to extract timestamp from a Kafka topic. There’re two kinds of message structure in this topic. One message structure is like { “version”: “2.0", “body”: { “timestamp”: { “$time”: “1655768409120" } } } Another message structure is like: { “version”: “3.0", “body”: { “timestamp”: 1655768409120 } } I can use jsonPathString to extract “timestamp” separately. { “columnName”: “timestamp”, “transformFunction”: “jsonPathLong(body, ‘$.timestamp.$time’)” } { “columnName”: “timestamp”, “transformFunction”: “jsonPathLong(body, ‘$.timestamp’)” } Any idea how to extract “timestamp” from the above two message structures using just one function?

Alice

06/21/2022, 10:21 AM

Hi, I use jsonPathLong to extract timestamp and it’s ok. { “columnName”: “timestamp”, “transformFunction”: “jsonPathLong(body, ‘$.timestamp’)” } Then I want to update the table config and add this transformation function. { “columnName”: “hoursSinceEpoch”, “transformFunction”: “toEpochHours(timestamp)” } But returned the following error. I had add this column in table schema.

AHMEDSHEHATA

06/21/2022, 2:58 PM

Hello folks , Is there is a way to handle nullable boolean (JSONPATH THROW ERROR AND JSONPATHSTRING as well during injection)?

Priyank Bagrecha

06/21/2022, 11:24 PM

i have a design / data model question. i am trying to count distinct accounts allocated in ab test buckets for an ab test over a time interval. so i have three fields, one is event time, second one is account id and second one is a hash map serialized to string. key of the has map is the experiment id and the value is the experiment bucket id. so basically

{e1:b1,e2:b2,e3:b3...}

an account can be in multiple ab tests. the query is going to be

SELECT exp_id, event_ts, bucket_id, DISTINCTCOUNTHLL(account_id) FROM table WHERE exp_id = <exp id> AND event_ts > start_time and event_ts < end_time GROUP BY event_ts, bucket_id

. event_ts has a granularity of some time interval so it is not a problem of high cardinality.

Alice

06/22/2022, 1:14 AM

Hi, my team are arguing about if pinot table partition config depends on kafka topic partition policy. We’re using lowlevel type to consume kafka stream data. According to this doc(the first picture), it doesn’t depend on kafka partition policy. But according to another doc(the second picture), it does depend on kafka partition policy. Could somebody help make it clear?

Alice

06/22/2022, 2:40 AM

Only

sum

max

min

supported as for today for realtimetoofflineSegmentsTask, right?

harry singh

06/22/2022, 5:00 AM

Hi, We are using Trino to fetch data stored in Pinot, via a BI tool. In the table the field

created_at

stores the epoch_timestamp in integer datatype. While querying, the BI tool generates a filter on the above field, with the syntax

from_unixtime(table1.created_at) AT TIMEZONE 'Asia/Kolkata'

this doesn't get pushdown, and Trino tries to load the entire table. any workaround for this? Also is there a looker-pinot connector in near future?