Apache Pinot #general

murat migdisoglu

10/14/2020, 9:31 PM

<!here> I have another issue now. During the realtime ingestion, after publishing the first segment with 50K rows, Pinot does not ingest anymore data. maybe it is not creating the new segment, Im not sure. Its an append type table("segmentPushType": "APPEND",) with ("segmentPushFrequency": "HOURLY") . Where might be the issue? I can't see any exception in any log file

Cesar

10/15/2020, 7:20 PM

I'm seeing what I believe are very large latency (a few hundreds MS) for processing some test queries in a local/test Pinot 0.5 setup and I'm wondering what might be the cause for that. Can anyone please advise if I'm doing something very wrong in this setup: https://www.dropbox.com/s/dio3zi5w1vdmclo/pinot-questions.txt?dl=0 ?

Venkatesan V

10/19/2020, 5:30 PM

Hello. Am trying to explore pinot on kubernetes. Have a few questions: 1. The official helm chart marks the broker and controller as statefulset. Any reason i need to keep them as statefulset sets as opposed to just deployments? As in, is there some ordering needed? 2. The controller requires a PV to be attached for volume(ref: https://github.com/apache/incubator-pinot/blob/master/kubernetes/helm/pinot/templates/controller/statefulset.yaml#L71). What is the nature of data being stored here? Or rather, what is the general recommendation for the disk size here? What purpose does this solve? 3. Broker doesn't look like it needs any specific disk. So, extending on 1 and 2 above, why does this need to be a statefulset? What ordering is needed here?

Chundong Wang

10/19/2020, 6:27 PM

About scalar functions: could such functions be used only as transform function, or could be applied to the result of aggregation? Eg to return the greater value of two

PERCENTILETDIGEST50

Seunghyun

10/19/2020, 10:56 PM

Do we have the documentation on how to use

map

column? I see

MAP_VALUE

function at https://docs.pinot.apache.org/users/user-guide-query/supported-transformations#multi-value-column-functions However, I don’t see any any instruction on how I can configure the schema & store map value to segments.

Sri Surya

10/21/2020, 4:37 AM

Hello! I am very new to pinot can anyone guide me how to install and run pinot on local windows machine?

Sri Surya

10/21/2020, 9:38 AM

When tried to execute the command(running on windows using git bash) the error is showing up all packages installed correctly $ bin/quick-start-streaming.sh Error: Could not find or load main class org.apache.pinot.tools.RealtimeQuickStart Caused by: java.lang.ClassNotFoundException: org.apache.pinot.tools.RealtimeQuickStart r <!here>

DarrenApacheDrill

10/22/2020, 3:40 AM

Hiiiiii all . 👍 ❤️ 🙂

🎉 2

🍷 2

🥃 2

👏 2

Derek

10/23/2020, 2:57 PM

hi all, first time here 🙂 i have a question about segments in a realtime table. i want to repopulate our tables, so i used the rest API to delete all segments in our table, and then pushed a bunch of stuff into our kafka topic. after deleting the segments, no new data is showing up and no new segments are being created. is there something else i need to do for new segments to be created?

Chundong Wang

10/23/2020, 8:05 PM

Any recommendation to do rolling aggregation (eg movingAvg of past 7 days for each hour of last 24 hours) efficiently inside Pinot?

Itzik Lavon

10/24/2020, 4:44 PM

hi guys, i have small question what are the difference between pinot to clickhouse(performance wise, if someone tried them both) clickhouse is bit more “easy” to use, but optimizations are harder, although seems like even without any special conf it performs really well pinot, the way i experienced it at least, is much more hard to set up, no addtional indexes, insertion is only with kafka, or any other stream processer

Noah Prince

10/25/2020, 6:42 PM

How does Pinot scale with offline tables? I get the impression that every offline segment is loaded into an active offline server, which implies all of your offline data is loaded in some server. This seems very expensive, especially for something like 2 year old data. Does pinot lazily load old segments based on query demand? And how do indexes scale into offline tables?

Noah Prince

10/26/2020, 1:47 PM

@User and I were discussing my team modifying the pinot server to include a

lazy

mode that would set it to lazily pull segments as they are requested using an LRU cache. It should just take some modification to the

SegmentDataManager

and maybe the table manager. This would allow using s3 as the primary storage, with pinot as the query/caching layer for long term historical tiers of data. Similar to the tiering example, you’d have a third set of lazy servers for reading data older than 2 weeks. This is explicitly to avoid large EBS volume costs for very large data sets. My main concern is this — a moderately sized dataset for us is 130GB a day. We have some that can be in the terra range per day. Using 500MB segments, you’re looking at ~260 segments a day. Maybe ~80k segments a year. In this case, broker pruning is very important because any segment query sent to the lazy server means materializing data from s3. This data is mainly time series, which means segments would be in time-bound chunks. Does Pinot broker prune segments by time? How is the broker managing segments? Does it just have an in-memory list of all segments for all tables? If so, metadata pruning will become a bottleneck for us on most queries. I’d like to see query time scale logarithmically with the size of the data. Other concerns for us are around data types. It does not seem Pinot supports data types we commonly use like uint64, fixedpoints, etc. It also doesn’t seem to support nested data structures. How difficult would this be to add? Java

BigInt

and

BigDecimal

could handle the former assuming we implemented metadata handling. Nested data types is a little more nuanced.

🙌 2

👍 3

Dharak Kharod

10/26/2020, 9:51 PM

Hi, We are working on a fact-dim join like functionality via Lookup UDF in Pinot, find more details on the github issue https://github.com/apache/incubator-pinot/issues/6191 Kindly provide your feedback. Thanks.

Noah Prince

10/27/2020, 12:54 AM

Having some issues getting a custom StreamMessageDecoder plugin to load, <thread> as not to clog up the channel.

lâm nguyễn hoàng

10/27/2020, 8:45 PM

hi everyone looks at the error when adding the realtime table, the error 500 (ClassNotFoundException: org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory) INFO [AddTableCommand] [main] {"code": 500, "error": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory"}

Seunghyun

10/27/2020, 8:54 PM

Thank you for the quick response. Time based pruning on the broker side is the optimization that would help for most of the time series use cases. We will work on this one.

Noah Prince

10/28/2020, 2:02 PM

Does a segment always consist of

columns.psf, creation.meta, index_map, metadata.properties

? I’m thinking for the s3 lazy loading, it might make sense to have separate caching settings for metadata vs

columns.psf

. Like you may want to eagerly load all or most of the metadata since it’s small and means segments can be eliminated quickly.

Noah Prince

10/28/2020, 6:23 PM

Is it possible to pause kafka collection on a table, but not querying? Seems like ChangeTableState disable makes queries return empty as well as pausing kafka

Ravi Chikkam

10/30/2020, 12:37 AM

is this meeting only for uber employees?

Ravi Chikkam

10/30/2020, 2:51 AM

Are there any comparisons to Apache Druid

Tanmay Movva

10/30/2020, 7:07 AM

Hello, I am trying to upload schemas/ use the swagger api in general. But I get

TypeError: Failed to fetch

and an Undocumented response for any api call. fyi, We have deployed Pinot in K8s.

Noah Prince

11/02/2020, 3:22 PM

How does fault tolerance work with servers in Pinot? I.e., what happens when a server crashes? My guess would be that, as a helix participant, somehow the controller sees it has crashed, and as such sends out messages to other servers to take the segments from the crashed controller? Then, when the server reboots, the controller sees a new server is available, and starts distributing segments to it? Requiring a rebalance to truly get a bunch of segments back onto it?

Kenny Bastani

11/02/2020, 10:25 PM

Welcome @User! So excited to have you

Greg Simons

11/02/2020, 10:34 PM

Hey @User Welcome to the wonderful world of Apache Pinot. Great to have you onboard !

❤️ 2

Kenny Bastani

11/02/2020, 11:04 PM

<!here> Hey all. I highly recommend spending some time with @User to talk about your use case and experience with Pinot. This is extremely crucial to the project and for our future. I rarely use

@here

notifications to this channel, but in this case, it’s important. Thanks everyone. calendly.com/karin-wolok

👍 4

Chundong Wang

11/04/2020, 10:03 PM

When it comes to real-time segments taking precedence over offline segments during query time, is it hard coded to be 24 hours, or it’d be a merge between real-time and offline segments, so if there’s no data in real-time during the specified time range, aggregates from offline segments would be served as query result?

Noah Prince

11/05/2020, 4:52 PM

Also is there an easy way to just sink a Spark dataframe to pinot segments?

vmarchaud

11/06/2020, 9:43 AM

Hey, i'm looking to setup pinot with GCS (in k8s) but i don't see how i'm supposed to add the plugin ? Is there some repository with built plugins or are they bundled by default with the docker images ? Thanks

vmarchaud

11/06/2020, 10:01 AM

found out that the plugin only need to be required using

-Dplugins.include=pinot-gcs

and that its bundled by default, however i'm trying out the

0.6.0-RC

and i got the following error:

Copy code

2020/11/06 09:07:02.367 ERROR [PluginManager] [main] Failed to load plugin [pinot-gcs] from dir [/opt/pinot/plugins/pinot-file-system/pinot-gcs] 
java.lang.IllegalArgumentException: object is not an instance of declaring class

➕ 1