https://pinot.apache.org/ logo
Join Slack
Powered by
# general
  • t

    The Alchemist

    09/04/2019, 10:03 PM
    you guys are awesome, thank you! 🙇
  • m

    Mayank

    09/04/2019, 10:04 PM
    Would be interested in learning more about your use case @User
  • t

    The Alchemist

    09/04/2019, 10:09 PM
    excellent question. my domain is cybersecurity (bro/zeek logs). i’m specifically planning on loading lots of
    conn
    logs (https://docs.zeek.org/en/stable/scripts/base/protocols/conn/main.bro.html)
  • t

    The Alchemist

    09/04/2019, 10:11 PM
    data looks like:
    Copy code
    1258790493.773208	CcH8zVkCER7UopU1j	192.168.1.104	137	192.168.1.255	137	udp	dns	3.748891	350	0	S0	-	-	0	D	7	546	0	0	-	00:0b:db:4f:6b:10	ff:ff:ff:ff:ff:ff	-	-
    1258790451.402091	CMucAv3aPcRqfSv8Dj	192.168.1.106	138	192.168.1.255	138	udp	-	-	-	-	S0	-	-	0	D	1	229	0	0	-	00:0b:db:63:5e:a7	ff:ff:ff:ff:ff:ff	-	-
    1258790493.787448	CLa6hu4FVZF6WEkdFf	192.168.1.104	138	192.168.1.255	138	udp	-	2.243339	348	0	S0	-	-	0	D	2	404	0	0	-	00:0b:db:4f:6b:10	ff:ff:ff:ff:ff:ff	-	-
    1258790615.268111	C4MQVQ3FXAlnFCOgX8	192.168.1.106	137	192.168.1.255	137	udp	dns	3.764626	350	0	S0	-	-	0	D	7	546	0	0	-	00:0b:db:63:5e:a7	ff:ff:ff:ff:ff:ff	-	-
    1258790615.289842	CpVVNr3W2hrDoSO0C2	192.168.1.106	138	192.168.1.255	138	udp	-	2.251581	348	0	S0	-	-	0	D	2	404	0	0	-	00:0b:db:63:5e:a7	ff:ff:ff:ff:ff:ff	-	-
    1258790620.442107	C9cjp71VsiuBn6w6P9	192.168.1.104	68	192.168.1.1	67	udp	dhcp	0.034866	311	300	SF	-	-	0	Dd	1	339	1	328	-	00:0b:db:4f:6b:10	00:19:e3:e7:5d:23	-	-
  • t

    The Alchemist

    09/04/2019, 10:12 PM
    (full example file at https://github.com/dgunter/ParseZeekLogs/blob/master/examples/conn.log, for the curious)
  • t

    The Alchemist

    09/04/2019, 10:14 PM
    ppl tend to put this data into elasticsearch, but performance is terrible and it seems to be a poor fit for the data model
  • m

    Mayank

    09/04/2019, 10:14 PM
    I see, thanks for sharing
  • t

    The Alchemist

    09/04/2019, 10:15 PM
    of course… i’ll try to write a blog post about it
  • t

    The Alchemist

    09/04/2019, 10:17 PM
    any advice on hardware sizing requirements? i.e., how much RAM, how many nodes, etc.?
  • t

    The Alchemist

    09/04/2019, 10:17 PM
    i’m using the offline quickstart for now, so it’s all on one box
  • m

    Mayank

    09/04/2019, 10:18 PM
    What's your latency/throughput requirements?
  • t

    The Alchemist

    09/04/2019, 10:19 PM
    all data is offline. query latency would ideally be <5 seconds. throughput is low: a few queries a minute at most
  • t

    The Alchemist

    09/04/2019, 10:19 PM
    most queries are of the
    COUNT.... GROUP BY
    variety
  • m

    Mayank

    09/04/2019, 10:20 PM
    And data size?
  • m

    Mayank

    09/04/2019, 10:20 PM
    You could go SSD routes with limited RAM (say 64GB)
  • k

    Kishore G

    09/04/2019, 10:23 PM
    also, try star-tree for frequently used dimensions in filter and group by
  • k

    Kishore G

    09/04/2019, 10:23 PM
    https://engineering.linkedin.com/blog/2019/06/star-tree-index--powering-fast-aggregations-on-pinot
  • t

    The Alchemist

    09/04/2019, 10:24 PM
    unfortunately, our cluster only has HDDs 😞
  • t

    The Alchemist

    09/04/2019, 10:25 PM
    but yes, i totally plan on using star indices. read the paper. very interesting
  • t

    The Alchemist

    09/04/2019, 10:25 PM
    we have lots of RAM, though. 64GB is fiine
  • k

    Kishore G

    09/04/2019, 10:26 PM
    yeah, with HDD's star tree will help a lot more
    👍 1
  • t

    The Alchemist

    09/05/2019, 2:20 PM
    @User: is there a guide for configuring/enabling star tree indices? i don’t see much at https://pinot.readthedocs.io/en/latest/index_techniques.html#star-tree-index
  • t

    The Alchemist

    09/05/2019, 2:33 PM
    <!here>: should i use pinot from
    master
    or use the 0.1 release from March?
  • k

    Kishore G

    09/05/2019, 3:09 PM
    there was this commit which had instructions
    👍 1
  • k

    Kishore G

    09/05/2019, 3:09 PM
    https://github.com/apache/incubator-pinot/pull/3743/files
  • k

    Kishore G

    09/05/2019, 3:09 PM
    but it was reverted
  • k

    Kishore G

    09/05/2019, 3:10 PM
    @User why did you revert this?
  • k

    Kishore G

    09/05/2019, 3:10 PM
    use master
    👍 1
  • j

    Jackie

    09/05/2019, 6:43 PM
    @User Because we wanted to make the engineer blog, so pulled it off temporary. Will revisit that and submit another pr
  • c

    Chinmay Soman

    09/05/2019, 9:30 PM
    Quick question: during the offline segment push - it looks like if the segment already exists and that the CRC is the same : we still update the segment metadata in Zk. Why is this done exactly ? FYI: this is causing issues in our setup where there's no shared filesystem mounted on the controllers and we haven't moved to using HDFS filesystem yet. So what happens is if there is a redundant upload request by the Spark job on a different controller, the Zk download URI is modified to this new controller - which doesn't have the segment 😞 😞 😞
1...838485...160Latest