Apache Pinot #general

Join Slack

Dan Hill

05/30/2020, 6:24 PM

Ah, ok.

Kishore G

05/30/2020, 6:24 PM

0,1,2, ...

Kishore G

05/30/2020, 6:24 PM

you can also use it as partition_id

Dan Hill

05/30/2020, 6:24 PM

Cool

Dan Hill

05/30/2020, 8:16 PM

I only did a top level look at the code. It looks like the segment name generation code uses methods called

getMinValue

and

getMaxValue

. Does it actually use values from the data source? Is there any rounding to the logic? If a replace job adds more missing events to a time frame that are after the latest previous time, then will I have to manually delete the old segments?

Kishore G

05/30/2020, 8:17 PM

it rounds off to the time granularity - daily in most cases

Dan Hill

05/30/2020, 8:18 PM

Cool

Dan Hill

05/31/2020, 12:49 AM

Let's say I have a list of labels I want to attach to a row in Pinot and query by it. E.g. experiment info. If I eventually want to query these as dimensions, what's a good way to handle this if multiple experiments are running? A separate table that denormalizes the list of experiments rows to separate rows and write the events to multiple times? Serializing the experiment info as json and extracting? Is this type of index supported?

Kishore G

05/31/2020, 1:25 AM

If labels are simple strings, then use multi value field

Kishore G

05/31/2020, 1:27 AM

You can index this column

Dan Hill

05/31/2020, 2:51 AM

Gotcha. Can a row be aggregated multiple times if there are multiple labels?

Mayank

05/31/2020, 5:01 PM

For multi valued columns, each value is considered a separate group to aggregate

Dan Hill

05/31/2020, 11:19 PM

I don't see examples of multi-value in the gitbook. Are there any examples in the github repository? If I had a row like

Copy code

{
  "labels": [
    "a",
    "b"
  ],
  "count": 1
}

If I query it without specifying that I want to group by label, would it return sum(count)==1? If I add a groupby labels, would it return sum(count)==2?

Mayank

05/31/2020, 11:31 PM

Without group by count will be 1

Mayank

05/31/2020, 11:32 PM

With group by count will be one per label

Dan Hill

05/31/2020, 11:48 PM

Sweet

Dan Hill

06/01/2020, 4:55 PM

Hi. When's the JDBC connector coming out?

Dan Hill

06/01/2020, 6:21 PM

In the gitbook, I see comments talking about dictionary encoding for fields. It makes sense that this works for offline tables. What about realtime tables? Is there a real time dictionary that is created as values are processed? Are there any scale limits to this?

Kishore G

06/01/2020, 6:23 PM

yes, pinot creates dictionary on the fly as the values are processed.

Kishore G

06/01/2020, 6:25 PM

https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime#ConsumingandIndexingrowsinRealtime-Dictionaries

Kishore G

06/01/2020, 6:27 PM

note that this is only true for consuming segments. once segments get flushed, its similar to offline segments

👍 1

Shounak Kulkarni

06/03/2020, 3:21 PM

hey all i am trying to find an equation between the segment size, total partitions and offheap memory on server. Running into oom on direct buffer memory during segment creation.

Copy code

Server cores= 7
Server jvm heap= 3 gb
Server MaxDirectMemorySize = 10gb
Segment size = 100mb
Consuming segments on server=6

load mode is MMAP and just fyi random load is generated so almost all consuming segments go into completion during same time slot. Thanks!

Subbu Subramaniam

06/03/2020, 4:00 PM

@User you may want to run the Realtime provisioning helper, as mentioned in https://engineering.linkedin.com/blog/2019/auto-tuning-pinot

Shounak Kulkarni

06/03/2020, 4:10 PM

@User actually I used this utility before for different segment size, I'll run it on current segment. Forgot to mention it before, the first segment creation cycle was done successfully but oom occurred on second segment completion cycle and that too 2 out of 3 serves didn't got oom only one server got this issue.

Subbu Subramaniam

06/03/2020, 4:14 PM

You are saying you ran out of direct memory during segment generation, right? Reducing the number of row may help, but I am not sure how many segments have even been generated, for your segment size to get large enough. You can try increasing the number of servers in the matrix that the tool produces. It gives an estimate of how much memory is used by the server in consuming and completed segments. Segment generation is extra, so maybe you are touching the ceiling there

Shounak Kulkarni

06/03/2020, 4:18 PM

Ok so what's the safe buffer that should be kept for the extra memory used in segment generation? and will there be retries done on this transition if it fails due to oom?

Subbu Subramaniam

06/03/2020, 4:19 PM

I am not sure about this, but perhaps one segment size worth (per partition consumed in the host)?

Subbu Subramaniam

06/03/2020, 4:20 PM

For us at Linkedin, heap mem consmption during segment build has been a bigger problem than direct mem

Shounak Kulkarni

06/03/2020, 4:25 PM

If we keep one segment per partition equivalent as buffer then it will make the required direct memory almost double... Or the assumption that all consuming segments won't go into completion at once should be agreed upon...

Kishore G

06/03/2020, 4:27 PM

Let’s move this to #C011C9JHN7R

👍 2