https://pinot.apache.org/ logo
Join Slack
Powered by
# general
  • d

    Dan Hill

    05/30/2020, 6:24 PM
    Ah, ok.
  • k

    Kishore G

    05/30/2020, 6:24 PM
    0,1,2, ...
  • k

    Kishore G

    05/30/2020, 6:24 PM
    you can also use it as partition_id
  • d

    Dan Hill

    05/30/2020, 6:24 PM
    Cool
  • d

    Dan Hill

    05/30/2020, 8:16 PM
    I only did a top level look at the code. It looks like the segment name generation code uses methods called
    getMinValue
    and
    getMaxValue
    . Does it actually use values from the data source? Is there any rounding to the logic? If a replace job adds more missing events to a time frame that are after the latest previous time, then will I have to manually delete the old segments?
  • k

    Kishore G

    05/30/2020, 8:17 PM
    it rounds off to the time granularity - daily in most cases
  • d

    Dan Hill

    05/30/2020, 8:18 PM
    Cool
  • d

    Dan Hill

    05/31/2020, 12:49 AM
    Let's say I have a list of labels I want to attach to a row in Pinot and query by it. E.g. experiment info. If I eventually want to query these as dimensions, what's a good way to handle this if multiple experiments are running? A separate table that denormalizes the list of experiments rows to separate rows and write the events to multiple times? Serializing the experiment info as json and extracting? Is this type of index supported?
  • k

    Kishore G

    05/31/2020, 1:25 AM
    If labels are simple strings, then use multi value field
  • k

    Kishore G

    05/31/2020, 1:27 AM
    You can index this column
  • d

    Dan Hill

    05/31/2020, 2:51 AM
    Gotcha. Can a row be aggregated multiple times if there are multiple labels?
  • m

    Mayank

    05/31/2020, 5:01 PM
    For multi valued columns, each value is considered a separate group to aggregate
  • d

    Dan Hill

    05/31/2020, 11:19 PM
    I don't see examples of multi-value in the gitbook. Are there any examples in the github repository? If I had a row like
    Copy code
    {
      "labels": [
        "a",
        "b"
      ],
      "count": 1
    }
    If I query it without specifying that I want to group by label, would it return sum(count)==1? If I add a groupby labels, would it return sum(count)==2?
  • m

    Mayank

    05/31/2020, 11:31 PM
    Without group by count will be 1
  • m

    Mayank

    05/31/2020, 11:32 PM
    With group by count will be one per label
  • d

    Dan Hill

    05/31/2020, 11:48 PM
    Sweet
  • d

    Dan Hill

    06/01/2020, 4:55 PM
    Hi. When's the JDBC connector coming out?
  • d

    Dan Hill

    06/01/2020, 6:21 PM
    In the gitbook, I see comments talking about dictionary encoding for fields. It makes sense that this works for offline tables. What about realtime tables? Is there a real time dictionary that is created as values are processed? Are there any scale limits to this?
  • k

    Kishore G

    06/01/2020, 6:23 PM
    yes, pinot creates dictionary on the fly as the values are processed.
  • k

    Kishore G

    06/01/2020, 6:25 PM
    https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime#ConsumingandIndexingrowsinRealtime-Dictionaries
  • k

    Kishore G

    06/01/2020, 6:27 PM
    note that this is only true for consuming segments. once segments get flushed, its similar to offline segments
    👍 1
  • s

    Shounak Kulkarni

    06/03/2020, 3:21 PM
    hey all i am trying to find an equation between the segment size, total partitions and offheap memory on server. Running into oom on direct buffer memory during segment creation.
    Copy code
    Server cores= 7
    Server jvm heap= 3 gb
    Server MaxDirectMemorySize = 10gb
    Segment size = 100mb
    Consuming segments on server=6
    load mode is MMAP and just fyi random load is generated so almost all consuming segments go into completion during same time slot. Thanks!
  • s

    Subbu Subramaniam

    06/03/2020, 4:00 PM
    @User you may want to run the Realtime provisioning helper, as mentioned in https://engineering.linkedin.com/blog/2019/auto-tuning-pinot
  • s

    Shounak Kulkarni

    06/03/2020, 4:10 PM
    @User actually I used this utility before for different segment size, I'll run it on current segment. Forgot to mention it before, the first segment creation cycle was done successfully but oom occurred on second segment completion cycle and that too 2 out of 3 serves didn't got oom only one server got this issue.
  • s

    Subbu Subramaniam

    06/03/2020, 4:14 PM
    You are saying you ran out of direct memory during segment generation, right? Reducing the number of row may help, but I am not sure how many segments have even been generated, for your segment size to get large enough. You can try increasing the number of servers in the matrix that the tool produces. It gives an estimate of how much memory is used by the server in consuming and completed segments. Segment generation is extra, so maybe you are touching the ceiling there
  • s

    Shounak Kulkarni

    06/03/2020, 4:18 PM
    Ok so what's the safe buffer that should be kept for the extra memory used in segment generation? and will there be retries done on this transition if it fails due to oom?
  • s

    Subbu Subramaniam

    06/03/2020, 4:19 PM
    I am not sure about this, but perhaps one segment size worth (per partition consumed in the host)?
  • s

    Subbu Subramaniam

    06/03/2020, 4:20 PM
    For us at Linkedin, heap mem consmption during segment build has been a bigger problem than direct mem
  • s

    Shounak Kulkarni

    06/03/2020, 4:25 PM
    If we keep one segment per partition equivalent as buffer then it will make the required direct memory almost double... Or the assumption that all consuming segments won't go into completion at once should be agreed upon...
  • k

    Kishore G

    06/03/2020, 4:27 PM
    Let’s move this to #C011C9JHN7R
    👍 2
1...136137138...160Latest