Apache Pinot #general

kant

07/11/2020, 10:49 PM

can I do pagination using presto and pinot on the joined result set? for example can I get 10K rows at a time up to say 1M? Currently we use ES and it display only 10K rows(without joins) by default and anything beyond that there is scroll API however it doesn't quite work well and takes a lot time

Mayank

07/13/2020, 5:37 PM

Can you elaborate a bit on your usage of BYTES types columns?

kant

07/15/2020, 7:39 AM

what's a good aws instance for pinot and presto?

Ankit

07/20/2020, 6:06 PM

Hi all, can anyone point out what does Ideal state of table means in pinot?...Getting exception: "Ideal state of table tableName does not exist"

Mayank

07/21/2020, 2:29 AM

No, this is the number of segments the query had to process.

Kishore G

07/21/2020, 8:58 PM

Amazing video by @User on Deploying Pinot on Kubernetes

https://www.youtube.com/watch?v=UR6rYEMZYLA▾

🙏 1

👍 10

Mayank

07/22/2020, 5:02 AM

Seems something went wrong in executing the plan (building the operators). Unfortunately it does not say much more. What version are you running?

Damiano

07/22/2020, 10:38 PM

Hello everybody! I would like to test Presto, i read it supports window functions. So, Can i query Pinot with window functions using Presto?

Damiano

07/23/2020, 2:01 PM

Hello everybody! Finally i get my small cluster up and running, thank you all for the support! 🙂 i am doing a final test to understand if i need to add one more node or not. However, just to make one thing a little bit clearer, i would like to know if we can "organize" data inside a Pinot Server by a specific column. For those of you who know Citus, I am referring to the distribution key for shards. Basically what i am asking is, if we have a specific column that is often used in group by clause, How can we store documents that have the same column (used in group by) on the same server? I think it is an important thing. Because for example, in my custom aggregation func i need to sort the documents of each segment (in

aggregateGroupBySV()

) before working on it (i am trying to do a similar thing that window functions do). I know that a Server has more segments and the documents order in segments could be random.... BUT if i have all the documents of that specific key in the same server i could avoid sorting again everything in

extractFinalResult()

that is called at Broker level. I know there is a

merge()

method used to merge all the results of each segment, if i can do something after that MERGE i can shift all the computation process at the Server level instead of Broker and i think it is an important thing, otherwise the Broker should work with all the results of each Server and then sort+compute (in my case).

Buchi Reddy

07/23/2020, 8:45 PM

Simple question: Does broker send the list of segments to be queried to Server along with the query? I think not but want to double check.

Apoorva Moghey

07/26/2020, 1:31 PM

A simple question: Does it make sense to keep segment retention 4 years in case of realtime tables? Shall we go for a hybrid table? What is the recommendation? or is this decision depend upon other factors? In our case, we will always be ingesting data via Kafka.

Apoorva Moghey

07/26/2020, 3:40 PM

Another simple question I was following this doc. https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime it mentioned

*pinot.server.instance.realtime.alloc.offheap: true*.

But I am not able to get the format of Pinot server configuration, is it YAML or JSON? There is a parameter called -configFileName. But it's description says

Broker Starter Config file

. I am little confused here. Can anybody help?

Ravi Singal

07/28/2020, 5:54 AM

We are planning to have a hybrid table with much longer retention for offline table. We are ingesting realtime data from kafka. How can we move the segments from realtime table to offline table after 7 days?

Kishore G

07/28/2020, 6:15 PM

sorry for spamming but ZK explorer was something pinot devs have been waiting for years. well the wait is finally over (well almost). Here is the new PR https://github.com/apache/incubator-pinot/pull/5763 that makes it super easy to explore cluster state and monitor Pinot.

🎉 12

🚀 1

👏 11

Damiano

08/01/2020, 8:04 AM

Hello, i do not remember where i should increase the timeout of the query during the debug of the code. Could anyone remind me?

Damiano

08/01/2020, 9:20 AM

Could anyone explain why Pinot is using GroupByOrderByCombineOperator instead of GroupByCombineOperator for queries like:

select max(profit) from transcript group by strategy_id

Renato Marroquín Mogrovejo

08/05/2020, 9:41 PM

hi there, new user here 😄 I just cloned incubator-pinot, and I was trying to run specific tests in the following way, I am inside

incubator-pinot/pinot-controller

Copy code

mvn test -Dtest=LeadControllerManagerTest

but I get this error

Copy code

[ERROR] Failed to execute goal on project pinot-controller: Could not resolve dependencies for project org.apache.pinot:pinot-controller:jar:0.5.0-SNAPSHOT: Failed to collect dependencies at org.apache.pinot:pinot-common:jar:0.5.0-SNAPSHOT: Failed to read artifact descriptor for org.apache.pinot:pinot-common:jar:0.5.0-SNAPSHOT: Failure to find org.apache.pinot:pinot:pom:${revision}${sha1} in <https://repo.maven.apache.org/maven2> was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

any ideas what it could be?

Kishore G

08/07/2020, 2:50 PM

Can you check the broker log, when you run the query via query console?

Mayank

08/07/2020, 4:38 PM

Pagination is currently not supported with group by

Neha Pawar

08/07/2020, 6:29 PM

Hey folks! Thanks for the great response on the Pinot videos we’ve released so far. We’re thinking about the next set of topics for videos, and would love your suggestions. Please vote for topics you’d be interested in watching, or add suggestions of your own. Thanks! https://poll.ly/#/LONwJ48K

🍷 3

Itzik Lavon

08/08/2020, 9:09 AM

Hi all i'm new to this is there any good videos on how to gather data from S3 and query it? another question, currently i use Athena to do the queries this is better? sorry for my ignorance 🙂

Ravikiran Katneni

08/10/2020, 2:51 AM

Hi All, I am new to Pinot. I am trying to ingest/upload a 7GB Lineitem TPCH table into Pinot. Entire file is getting uploaded as a single segment. Does Pinot support any configuration to specify segmentation column(column based on which segments get created from ingested file/data)? When I explicitly split file into multiple files then multiple segments are getting created. Does Pinot expect pre-segmented data to be ingested/uploaded?

Andrew First

08/10/2020, 7:15 PM

Hi! I’m evaluating Pinot for the following use case and want to know if it’s a good fit, or any best practices to help achieve it. • Ingest events for ~1B total users at ~100k/second • Run aggregation queries on events filtered on individual user IDs at ~10k/second, each query completing in < 100ms What I understand is that the data is organized primarily by time (segments) and secondarily (within a segment) by indexes. In this case, I tried sorting by user ID. To query for a particular user ID, it seems that each segment must be queried, since the data is not consolidated by user. The runtime would be O(s log n) where s is the number of segments in a particular timeframe and n is the number of events per segment. Thus, it seems that Pinot may not scale when there are tens/hundreds of thousands of segments and may not be a good fit here. However, this use case seems similar to the use cases at Linkedin, such as the “who’s viewed your profile” feature, which also would operate on events for individual users. Is my understanding correct, and is there anything I’m missing here? Would appreciate any thoughts or resources you could point me to. Thanks!

Anthony Tran

08/11/2020, 1:57 AM

Hi Team! First post! We’re evaluating Pinot for our use case and wanted to get some of your thoughts on if it’s a good fit for our use case and/or best practices to make it happen. The main complication we’re running into is we feel that we may need to be able to mutate our data which may not be a good fit for pinot (maybe this can be avoided with some smarter data modeling or some future tech?). We’re attracted to pinot because it’s ability to perform fast aggregation and reduce eng cost from having to do things like precubing data. • In particular we have two streams of order data (e.g. you can imagine booking details like total price in $, an order id, account id, user name, date, etc) that are flowing into our system. • The two streams (let’s call them “Fast Stream” and “Accurate Stream”) of order data may overlap (i.e. the Fast Stream and the Accurate Stream may both have order info for “order 1” but Fast Stream may be the only one that has “order 2” or Accurate Stream may be the only one that has "order 3") • Ideally we want to merge these streams together such that whenever they overlap (if they overlap), we use the data from Accurate Stream instead because it has richer user details and more accurate reporting of price. We want to be able to do things like get time based aggregate totals based on account id quickly. Is there a good way to model this since we have two data sources we want to merge? Thanks so much for your help!

Adrian Cole

08/12/2020, 5:38 AM

aloha folks. I'm toying with ServiceManager to start 3-in-1. such that controller, server and broker start as one

Samuel Dehouck

08/14/2020, 12:45 AM

hey everyone, seems like images are missing for the indexing section of the documenation: https://docs.pinot.apache.org/basics/indexing/forward-index

Sundar Djeabalane

08/14/2020, 9:17 PM

Hi Everyone ! I’m looking for options to expose presto (presto coordinator) outside kubernetes with some basic auth. we are currently running Pinot and presto inside kubernetes. We have a requirement where we want to expose presto outside as a service so clients can connect via presto consume the data. please let me know if anyone has implemented the authentication at the presto layer.

Adrian Cole

08/18/2020, 12:53 AM

aloha. I was wondering what the client upgrade policy would be for org.apache.pinot:pinot-java-client and org.apache.pinot:pinot-tools wrt order

Ankit

08/19/2020, 7:46 AM

Hi all, Pinot seems to have Map data type support for column. Is query possible on it?…if possible, is there any example i can refer to?

Joey Pereira

08/19/2020, 5:13 PM

I was having some trouble making a schema/table with transform configs, following the documentation at https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations#column-transformation (=> thread)