Apache Pinot #general

Shadab Anwar

02/16/2022, 7:10 PM

I want to create tenant in my pinot cluster but the documentation does not clearly how should i do that in kubernetes. Does tagging here mean labelling as in kubernetes as mentioned in docs ? Find my release file

pinot-release.yaml

Mohemmad Zaid Khan

02/17/2022, 7:11 AM

@User @User Is it deliberate that we don’t call toString() before calling hashCode() function here in HashCodePartitionFunction? If it is not then it’s a bug. https://github.com/apache/pinot/blob/master/pinot-segment-spi/src/main/java/org/ap[…]ache/pinot/segment/spi/partition/HashCodePartitionFunction.java Since we don’t call the

toString()

, A different hashCode is being generated for same value when segment pruning is done by PartitionSegmentPruner because it always call toString on literal value before invoking getPartitionId.

Pavel Stejskal

02/17/2022, 10:39 AM

Hello! I’m running PoC with Pinot for quite heavy data. I’ve got table with ~ 20 billions rows, 5 predicates (pred1 cardinality ~ 5 millions uniqs, pred2 and pred3 the same - tight correlation, last pred5 has very low cardinality, tens of values). I need to achieve the best possible speed for lookups by these predicates for whole range (20-50 billions/rows). Currently my table is creating for these predicates bloom & inverted indices. Second problem is ingestion rate - apparently there is no problem to get ~ 160k/s documents which is insane in contrast to resources needed, but at the same time the query performance is very bad - 6 servers are pretty busy with ingesting and GC thus query is pretty bad, 20-50 seconds. My current setup is 6 servers, 1 controller. Split commit enabled to s3. Because there will be low QPS, I need to achieve low memory allocation for indices/segments. Do I need to consider some kind of bucketing/hidden partitioning for predicate values or is Pinot able to handle these data in SLA ~ 1000-3000 ms only with proper indexing? I can imagine some sort of work delegation for servers, e.g. consuming/segment creating ~ 3-4 servers and for querying allocate 6 servers. PS: I’ve got replication 1 for space saving as final total will be ~ 20 TB, segment size is currently 460MB (but in table is set to 1GB). Ingesting from 36 kafka partitions Any improvements, thoughts or tricks are welcomed! 🙂

Weixiang Sun

02/18/2022, 9:28 PM

A quick question about realtime table, all data inside the in-memory segment (mutable segment) should be in memory even though the pinot is columnar, right? As for the offline segment, only the columns in use are loaded into memory?

Trust Okoroego

02/21/2022, 9:46 AM

Hello! I need to connect presto to Pinot with basic Auth. Could anyone point point me to how I can set this in the pinot.properties Presto catalog configuration.

Minglei Zhang

02/21/2022, 12:44 PM

Hi, Why do we use DISTINCTCOUNT instead of using COUNTDISTINCT here ?

Dan DC

02/21/2022, 1:06 PM

Hello, I've got a question about realtime tables. If I'm correct the kafka consumer group ID is built in the code using the table name and replica ID, however I'm not able to find a consumer group for the table in my kafka cluster. Is there a way to list all the consumer groups that a realtime table is using? I would look like those IDs are stored in ZK under ideal states but I can't find them. Thanks

Chengxuan Wang

02/22/2022, 4:29 AM

hello everyone, wondering if we have

st_setsrid

like function in crdb to change the spatial reference system?

Prashant Pandey

02/22/2022, 6:14 AM

Hello team, I am trying to run the Realtime Provisioner for one of my tables with the following config:

RealtimeProvisioningHelper -tableConfigFile /Users/prashant.pandey/table_config.json -numPartitions 4 -pushFrequency null -numHosts 12 -numHours 2 -sampleCompletedSegmentDir /Users/prashant.pandey/segment_dir -ingestionRate 4750 -maxUsableHostMemory 10G -retentionHours 24

The segment is around 426M in size. But this returns the following:

Copy code

Note:

* Table retention and push frequency ignored for determining retentionHours since it is specified in command
* See <https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime>
2022/02/22 11:41:31.825 INFO [RealtimeProvisioningHelperCommand] [main] 
Memory used per host (Active/Mapped)

numHosts --> 12              |
numHours
 2 --------> NA              |
2022/02/22 11:41:31.826 INFO [RealtimeProvisioningHelperCommand] [main] 
Optimal segment size

numHosts --> 12              |
numHours
 2 --------> NA              |
2022/02/22 11:41:31.826 INFO [RealtimeProvisioningHelperCommand] [main] 
Consuming memory

numHosts --> 12              |
numHours
 2 --------> NA              |
2022/02/22 11:41:31.827 INFO [RealtimeProvisioningHelperCommand] [main] 
Total number of segments queried per host (for all partitions)

numHosts --> 12              |
numHours
 2 --------> NA              |
Class transformation time: 0.271994872s for 4134 classes or 6.579459893565553E-5s per class

Why am I getting

N/A

s? Is the config incorrect?

Ali Atıl

02/22/2022, 7:49 AM

Hello everyone 🙂 https://github.com/apache/pinot/issues/6921 I was wondering if there is any update on this issue? Is there any work done on it or are you planning on implementing this feature in the near future? Wish everybody a great day!

KISHORE B R

02/22/2022, 12:52 PM

Hi, is there any approach to view the contents stored on segment ?

Karin Wolok

02/22/2022, 1:36 PM

Meetup tomorrow!! Feel free to share with friends who you think would benefit . 🙂 https://www.meetup.com/apache-pinot/events/283880626/

❤️ 1

Karin Wolok

02/22/2022, 5:52 PM

Welcome 👋 to all the new Apache Pinot 🍷 community members! Please tell us who you are and what brought you here! 😃 @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User

🍷 1

🙂 3

Tiger Zhao

02/22/2022, 8:43 PM

Hi, just wondering how does replicas work for realtime tables in terms of choosing which replicas to query? From what I can see, it appears that the broker randomly chooses which replica to use when querying.

KISHORE B R

02/23/2022, 2:08 PM

Hi, I was performing stream ingestion through kafka in a standalone machine. I had 5 partitions created and hence 5 segments in pinot. The parameter "segment.flush.threshold.size" is set to 10000. When i try ingesting data with 100k records, only 50k records are available. Will the flushing of consuming segment take time to update or is 50k the upper bound for the configuration mentioned ?

sunny

02/25/2022, 3:15 AM

Hi :-) I am new to Pinot. I am trying to test ACL. I want to set table ACL on user. I checked that controller, broker has acl config. but whenever add table or change table ACL, should I restart controller / broker ???

kaivalya apte

02/25/2022, 10:34 AM

Hello, I want to run realtimeprovisioninghelper, where can I find a sample completed segment?

Dan DC

02/28/2022, 12:37 PM

Hi, I've noticed my RealtimeToOfflineSegmentTask is not working anymore. I'm losing segments because they are not moved to the offline table. I only see 2 errors in the logs: one says "Job TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_.... exists in JobDAG but JobConfig is missing! Job might have been deleted manually from the JobQueue: TaskQueue_RealtimeToOfflineSegmentsTask, or left in the DAG due to failed clean-up attempt from last purge" the other error is specific to a table and says "Got unexpected instance state map: {<list of pinot servers here>} for segment: <segment name here>"

Saravanan Arumugam

03/01/2022, 5:31 PM

Hi everyone. My name is Saravanan. I got to know about Pinot from

one of the youtube▾

videos by Kishore. It's interesting to see how things work and amazing to see the practical applications of this system. I am here to learn more about it and along the way contribute in any possible manner.

👋 1

Jaromir Hamala

03/02/2022, 8:10 AM

Hello, congratz on the tiered storage! I'm reading the announcement and it says: Note that this is not implemented as lazy-loading - Pinot servers directly query data on the cloud and are never downloading the entire segments locally. May I ask how does it work? I know close to nothing about S3, but I believe it's a dummy blob-store. You have to download blobs with segments before querying them, don't you? Am I missing anything? Thanks for any hint!

Ayush Kumar Jha

03/03/2022, 5:55 AM

hey everyone,This tiered storage thing sounds cool.Is it available for azure blob or it is in the pipeline??

Shadab Anwar

03/03/2022, 9:21 AM

Hi just need a confirmation. When i created my tables, my tables did not have any data but segments were created. I checked my s3 and there was no segment uploaded. However, as soon as data arrived in my tables, I checked and saw that segments were then uploaded to S3. So, wanted to confirm if segments are uploaded only when it has some data ??

Lakshmanan Velusamy

03/03/2022, 9:56 PM

Hi Community, Can the timezone argument for DATETRUNC come from an another column in the table?

Diana Arnos

03/04/2022, 9:45 AM

Hello everyone 😄 Out of curiosity, do you have any idea when the next version will be released? 👀

Chengxuan Wang

03/04/2022, 2:12 PM

hey I was trying to use geoindex feature in pinot. but seems the index doesn’t apply because the

numEntriesScannedInFilter

. is high (equals to the number of docs). the pinot version is

0.8.0

. the query is

Copy code

select count(*) from some_table where  st_distance(resto_st_point, st_point(116.459717 , 39.955734, 1)) < 3000

the table config is

Copy code

"fieldConfigList": [
      {
        "name": "resto_st_point",
        "encodingType": "RAW",
        "indexType": "H3",
        "properties": {
          "resolutions": "12"
        }
      }
    ],
.....
      "noDictionaryColumns": [
        "resto_st_point"
      ],

and if i change the threshold to 300 (meters), the index hits.

Chengxuan Wang

03/04/2022, 4:26 PM

another question related to geoindex, from the doc , seems only

ST_Distance

can take advantage of h3 index, how about

ST_Contains

Weixiang Sun

03/04/2022, 7:08 PM

Does Pinot provide any tool to merge small segments into bigger segments? We have mis-configuration creating a lot of small segments. This is problematic. I am wondering if we can mitigate it by merging the small segments.

Karin Wolok

03/08/2022, 1:39 PM

Hey hey! 👋 Welcome all you new Pinot slack members! 🍷 ❤️ Would love to know who you are and what brought you here! Please take a moment and give us a short 1 liner about yourself! 👂 @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User

👋 2

🙃 1

thankyou 1

Monica

03/09/2022, 4:21 AM

hey everyone, What size data do you store on pinot, how many machines are used and what are the machine configurations like?our current business is about PB size, but we store in a different way from pinot.we use HBase to store fields' inverted index and write row position in another hbase's table.Then we fetch filtered records from HDFS.we use some technics to reduce random IO, like compression,encoding, store data in batching, cache, etc. Because our data are stored as a row-format, it's really bad when query results hit large numbers. As far as i know, I guess when a query needs to read large segments(if it can't prune data on partition, star-tree...), is it painful for pinot, cause pinot may need to download lots of segments from segment store and rebuild each segment's index in servers' memory?

Weixiang Sun

03/09/2022, 6:27 AM

When ingesting the streaming data from kafka, how to concatenate array of strings from one source column to destination column as part of ingestionConfiguration?