Apache Pinot #general

Weixiang Sun

09/28/2021, 6:07 PM

What does “disabling the realtime table” mean? No Streaming Ingestion and No query served? I do not see any specific document about it.

Carl

09/30/2021, 2:16 AM

Hi team, we are currently evaluating a solution using Pinot hybrid table to produce a dataset with both S3 offline historical data and Kafka real time data. Is there some documents we can find the information about what hybrid table setup support and doesn’t support, regarding e.g. ingestion, query and retention etc. Thanks.

Dunith Dhanushka

09/30/2021, 3:50 AM

I know the Lambda architecture is old-school. But is it correct to say that Pinot fits into the ‘serving layer’ there?

Dan DC

09/30/2021, 12:49 PM

Is there a way to configure the desired segment size and segment creation job? I've got some small avro files and the job seems to create a segment per file, is this how it works? I'd like to squash these small files in to one bigger segment. Do I need to pre-process them myself before running the job?

Shishpal Vishnoi

10/01/2021, 3:09 AM

I have parquet data placed in the S3 under prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/*. I want to use these partitions(year, month, day) to filter data by partition value in pinot. How can I do it?

Prabhakar Reddy

10/01/2021, 3:19 AM

Does pinot supports arm arch which can leverage AWS EC2 Gravtion 2 processor ?

Dan DC

10/01/2021, 2:29 PM

I'm looking through the code to see if I can load config properties from the env vars instead of the files. It doesn't seems like this is supported at the moment, Can someone confirm? This is particularly useful for credentials

Romeo

10/03/2021, 10:12 PM

Hi all, is there a way to specify retention for old updates when using upsert tables? My understanding from the docs is that upserts are still append only just that querying shows the newest one. I want to config it such that all updates are deleted after 180 days except the newest one even if it's older than 180 days, thus if no new upserts come in within 180 days only the last upsert will exist. Use case is as a serving layer as described in the enterprise application development section of the docs. Thanks

👍 1

Karin Wolok

10/04/2021, 10:19 AM

👋 Heyyyy to all the new 🍷 members! 😄 Please tell us who you are and what brought you here! @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User

👍 1

Dan DC

10/04/2021, 10:45 AM

Hi, I'm experiencing a strange issue today. My realtime table has bucketTimePeriod of 1d and bucketTimePeriod of 2d. My offline workflows are not running and I can see a message in the logs that says "Window data overflows into CONSUMING segments for partition of segments..." then "Found no eligible segments for task: RealtimeToOfflineSegmentsTask with window [1555200000 - 1641600000]. Skipping task generation...". Note these timestamps seem to be in seconds instead of milliseconds. I can see the segment.end.time and segment.start.time values are in seconds which I'm not sure of whether this was the case before. Looking through the code I can see TimeUtils compute the window using milliseconds so this is why the window spans 2 years instead of 2 days. I'm trying to figure out why this is happening now, any help is appreciated

Subin T P

10/05/2021, 1:38 PM

Hello🙌 Would like to track the average message waiting period of message in pinot input topic. Do we have any metric for that ?

Prashant Pandey

10/06/2021, 7:42 AM

Hi folks I am trying to build a docker image from the source as:

Copy code

./docker-build.sh pinot:new-range-index master <https://github.com/apache/incubator-pinot.git>

This gives me an error:

Copy code

executor failed running [/bin/sh -c git clone ${PINOT_GIT_URL} ${PINOT_BUILD_DIR} &&     cd ${PINOT_BUILD_DIR} &&     git checkout ${PINOT_BRANCH} &&     mvn install package -DskipTests -Pbin-dist -Pbuild-shaded-jar -Dkafka.version=${KAFKA_VERSION} -Djdk.version=${JDK_VERSION} &&     mkdir -p ${PINOT_HOME}/configs &&     mkdir -p ${PINOT_HOME}/data &&     cp -r pinot-distribution/target/apache-pinot-*-bin/apache-pinot-*-bin/* ${PINOT_HOME}/. &&     chmod +x ${PINOT_HOME}/bin/*.sh]: exit code: 1

Anything that I need to do/configure to fix this?

Shubham Dhal

10/10/2021, 12:12 PM

Hey, excited to join and looking forward to learning from the discussions here. I am a quant dev who recently shifted to a SWE role and work on streaming infra. I came across Pinot since a a lot of my work involves around Kafka (Kakfa Streams, Connect etc..) and systems like Pinot/Druid fit really well into the ecosystem. As for someone who has worked entirely in an investment bank as a desk quant, coming across Apache Projects and the community is really exciting. Hope to someday learn more so as to be able to contribute to one of these projects :)

👋 6

Manish Soni

10/11/2021, 7:31 AM

Hi Team, I was going through the Pinot Upsert flow documentation and youtube video(

https://www.youtube.com/watch?v=CnSnLKQLuXc▾

) and I have couple of questions regarding this: Questions: 1. Why do we really need to partition the input stream based on the primary key? 2. When we are maintaining a primary key Index we can have the records going to any of the segments and update the same in the primary key index, why should we ensure that record with same primary key should go to the same segment?

suraj kamath

10/12/2021, 6:38 AM

Hi Folks, We are trying to construct a tabular view from data in pinot. Eg: Get the list of top 10 userId's from Table A, get names of those users using lookup from Table B. Is this supported using lookup?

Charles

10/12/2021, 9:13 AM

Hi Folks， I am using prestodb, but fond an issue when using order by , Could anyone help check?

Luis Fernandez

10/12/2021, 2:56 PM

hey friends, our team is trying to query pinot from the query console, and we are trying to understand some of the latency in the queries we are currently executing a query like the following :

select * from table where user_id = x

when we first hit a query like this we get more than 500ms after we hit it again we get good results i guess it’s because the segment gets closer to memory, i was wondering why something like this would happen 500ms is def out of our expectations for query latency, our current configuration of the table has indexing and it’s a real time table. our current config for noDictionaryColumns

Copy code

"noDictionaryColumns": [
        "click_count",
       "impression_count",      
],

so that we can aggregate in our dimensions using “aggregateMetrics” : true segment flushing config configurations:

Copy code

"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "24h",
"realtime.segment.flush.segment.size": "250M"

we have rangeIndex in our serve_time which is an epoch timestamp to the hour. we have an invertexindex on the user_id and sortedcolumn as well as a partition map with 4 partitions with modulo. we chose 4 partitions because the consuming topic has 4 partitions. the consuming topic is getting around 5k messages a second. finally we currently have 2 servers with 4gigs of heap for java and 10g in the machine itself 4 cpu and 500G of disk space. at the moment of writing this message we have 96 segments in this table. metrics from what we issue a query like the one seen above:

Copy code

timeUsedMs	numDocsScanned	totalDocs	numServersQueried	numServersResponded	numSegmentsQueried	numSegmentsProcessed	numSegmentsMatched	numConsumingSegmentsQueried	numEntriesScannedInFilter	numEntriesScannedPostFilter	numGroupsLimitReached	partialResponse	minConsumingFreshnessTimeMs	offlineThreadCpuTimeNs	realtimeThreadCpuTimeNs
264	40	401325330	2	2	93	93	4	1	0	320	false	-	1634050463550	0	159743463

could anyone direct me into what to look into even this queries based on the trouble shooting steps don’t seem to have much numDocsScanned and numEntriesScannedPostFilter

lalit bhagtani

10/13/2021, 3:40 PM

Hi all, I have one question regarding failing of a server node. So here is the situation, my pinot cluster is ingesting data from kafka, and before my segment is completed that server hosting consuming segment is died. Now when my new server will be up, will it start consuming these lost records from kafka again or not, and if yes how it will know that it has to consume from this index from this partition. Thanks

Prateek Singhal

10/13/2021, 11:43 PM

Hi Folks, I have a couple of questions regarding Star-Tree index: 1. In the image attached, is it possible for me to get D1-V1 and D1-Star as results of the same query? The assumption here is that since nodes in star-tree index are pre-aggregated, can I somehow pull two of them out in one-go. (I guess subqueries with 2 different filter conditions would be one solution, but Pinot does not support that) 2. In the realtime table, is it possible to use star-tree index? My understanding is that since star-tree index requires pre-aggregation, it may not be applicable to real-time tables. If that’s the case, is it possible to activate star-tree index without upserts?

suraj kamath

10/18/2021, 6:17 AM

Hi Team, I was looking at ID_SET and IN_SUB_QUERY provisions in pinot for handling subqueries referring the below video:

https://www.youtube.com/watch?v=HryANqHnTQk&t=686s▾

Here I have few questions: 1. Is the ID_SET only supported for integer values? 2. Is there support for alphanumeric strings? Any pointers would be helpful

Manish Soni

10/18/2021, 6:22 AM

Hi Team, Is there any document available where I can get the definition of counters/metrics exposed from Pinot for Prometheus?

Vibhor Jain

10/18/2021, 12:05 PM

Hi Team, As part of handling duplicates in our hybrid table, we thought of using "mergeType": "dedup" for moving data from realtime to offline table. The problem we are facing is, one of our column is storing encrypted value and even for duplicate rows, this value is changing everytime. Is there a way to perform "dedup" on a subset of columns for moving data to offline table via minion?

Kamal Chavda

10/18/2021, 4:32 PM

Hi All, Any advice/suggestions on how to handle null values in date column with valid values same as the default

1970-01-01

in Pinot (ex: date of birth)? In my real time table schema I have the date defined as below under dateTimeFieldSpecs:

Copy code

{
  "name": "date_of_birth",
  "dataType": "TIMESTAMP",
  "format": "1:DAYS:TIMESTAMP",
  "granularity": "1:DAYS"
}

Ali Atıl

10/20/2021, 8:41 AM

Hey, does H3index only apply to ST_Distance function? if so any suggestions to query points lies inside a polygon fastest way possible? i have a table with latitude and longitude columns

kauts shukla

10/20/2021, 8:43 AM

Hi All, I have a defined schema : “name”: “properties”, “dataType”: “JSON”, i’m consuming messages from Kafka. In table value is coming as NULL whereas in kafka topic data is coming as expected : {“type”“event”,“ip”“127.0.0.1",“created_at”1634102442620,“properties”{“city”“abc”,“clinic”“”,“symptomId”“”,“treatmentId”“”}}. Any help here why is it happening ????

Alexander Vivas

10/20/2021, 12:02 PM

hey guys, good afternoon, I'd like to know if any of you have had to code a client to query pinot in scala? Is there a library to connect to pinot for scala clients?

Arpit

10/20/2021, 1:59 PM

Hi, I have created a realtime table on 0.8.0 Pinot cluster. Data is getting in pinot but I see this log msg for one segment "Stopping consumption due to row limit nRows=100000 numRowsIndxed=10000 numRowsconsumed=100000" Also i checked the debug endpoint in swagger and it shows below result for segment

Map

10/20/2021, 4:19 PM

Hi, if Pinot expects a field to be numeric but receives a string value, how does Pinot handle it?

Grant Sherrick

10/22/2021, 3:02 PM

👋 Hi y’all, I’m taking the day off work to volunteer on pinot. I mostly work on streaming data related tasks day to day, would it be reasonable if I gave implementing a

KafkaThriftMessageDecoder

a try? I haven’t seen anyone clammering for it, but I thought it might be worth giving it a go.

👋 1

Arpit

10/25/2021, 3:17 PM

Hi, I would like to setup a Pinot cluster with multiple controllers,servers and brokers on different hosts. I can see in document that controller should have a shared volume. Do servers, brokers, controllers running on different hosts should be reachable to each other?