Apache Pinot #general

Join Slack

The Alchemist

09/04/2019, 10:03 PM

you guys are awesome, thank you! 🙇

Mayank

09/04/2019, 10:04 PM

Would be interested in learning more about your use case @User

The Alchemist

09/04/2019, 10:09 PM

excellent question. my domain is cybersecurity (bro/zeek logs). i’m specifically planning on loading lots of

conn

logs (https://docs.zeek.org/en/stable/scripts/base/protocols/conn/main.bro.html)

The Alchemist

09/04/2019, 10:11 PM

data looks like:

Copy code

1258790493.773208	CcH8zVkCER7UopU1j	192.168.1.104	137	192.168.1.255	137	udp	dns	3.748891	350	0	S0	-	-	0	D	7	546	0	0	-	00:0b:db:4f:6b:10	ff:ff:ff:ff:ff:ff	-	-
1258790451.402091	CMucAv3aPcRqfSv8Dj	192.168.1.106	138	192.168.1.255	138	udp	-	-	-	-	S0	-	-	0	D	1	229	0	0	-	00:0b:db:63:5e:a7	ff:ff:ff:ff:ff:ff	-	-
1258790493.787448	CLa6hu4FVZF6WEkdFf	192.168.1.104	138	192.168.1.255	138	udp	-	2.243339	348	0	S0	-	-	0	D	2	404	0	0	-	00:0b:db:4f:6b:10	ff:ff:ff:ff:ff:ff	-	-
1258790615.268111	C4MQVQ3FXAlnFCOgX8	192.168.1.106	137	192.168.1.255	137	udp	dns	3.764626	350	0	S0	-	-	0	D	7	546	0	0	-	00:0b:db:63:5e:a7	ff:ff:ff:ff:ff:ff	-	-
1258790615.289842	CpVVNr3W2hrDoSO0C2	192.168.1.106	138	192.168.1.255	138	udp	-	2.251581	348	0	S0	-	-	0	D	2	404	0	0	-	00:0b:db:63:5e:a7	ff:ff:ff:ff:ff:ff	-	-
1258790620.442107	C9cjp71VsiuBn6w6P9	192.168.1.104	68	192.168.1.1	67	udp	dhcp	0.034866	311	300	SF	-	-	0	Dd	1	339	1	328	-	00:0b:db:4f:6b:10	00:19:e3:e7:5d:23	-	-

The Alchemist

09/04/2019, 10:12 PM

(full example file at https://github.com/dgunter/ParseZeekLogs/blob/master/examples/conn.log, for the curious)

The Alchemist

09/04/2019, 10:14 PM

ppl tend to put this data into elasticsearch, but performance is terrible and it seems to be a poor fit for the data model

Mayank

09/04/2019, 10:14 PM

I see, thanks for sharing

The Alchemist

09/04/2019, 10:15 PM

of course… i’ll try to write a blog post about it

The Alchemist

09/04/2019, 10:17 PM

any advice on hardware sizing requirements? i.e., how much RAM, how many nodes, etc.?

The Alchemist

09/04/2019, 10:17 PM

i’m using the offline quickstart for now, so it’s all on one box

Mayank

09/04/2019, 10:18 PM

What's your latency/throughput requirements?

The Alchemist

09/04/2019, 10:19 PM

all data is offline. query latency would ideally be <5 seconds. throughput is low: a few queries a minute at most

The Alchemist

09/04/2019, 10:19 PM

most queries are of the

COUNT.... GROUP BY

variety

Mayank

09/04/2019, 10:20 PM

And data size?

Mayank

09/04/2019, 10:20 PM

You could go SSD routes with limited RAM (say 64GB)

Kishore G

09/04/2019, 10:23 PM

also, try star-tree for frequently used dimensions in filter and group by

Kishore G

09/04/2019, 10:23 PM

https://engineering.linkedin.com/blog/2019/06/star-tree-index--powering-fast-aggregations-on-pinot

The Alchemist

09/04/2019, 10:24 PM

unfortunately, our cluster only has HDDs 😞

The Alchemist

09/04/2019, 10:25 PM

but yes, i totally plan on using star indices. read the paper. very interesting

The Alchemist

09/04/2019, 10:25 PM

we have lots of RAM, though. 64GB is fiine

Kishore G

09/04/2019, 10:26 PM

yeah, with HDD's star tree will help a lot more

👍 1

The Alchemist

09/05/2019, 2:20 PM

@User: is there a guide for configuring/enabling star tree indices? i don’t see much at https://pinot.readthedocs.io/en/latest/index_techniques.html#star-tree-index

The Alchemist

09/05/2019, 2:33 PM

<!here>: should i use pinot from

master

or use the 0.1 release from March?

Kishore G

09/05/2019, 3:09 PM

there was this commit which had instructions

👍 1

Kishore G

09/05/2019, 3:09 PM

https://github.com/apache/incubator-pinot/pull/3743/files

Kishore G

09/05/2019, 3:09 PM

but it was reverted

Kishore G

09/05/2019, 3:10 PM

@User why did you revert this?

Kishore G

09/05/2019, 3:10 PM

use master

👍 1

Jackie

09/05/2019, 6:43 PM

@User Because we wanted to make the engineer blog, so pulled it off temporary. Will revisit that and submit another pr

Chinmay Soman

09/05/2019, 9:30 PM

Quick question: during the offline segment push - it looks like if the segment already exists and that the CRC is the same : we still update the segment metadata in Zk. Why is this done exactly ? FYI: this is causing issues in our setup where there's no shared filesystem mounted on the controllers and we haven't moved to using HDFS filesystem yet. So what happens is if there is a redundant upload request by the Spark job on a different controller, the Zk download URI is modified to this new controller - which doesn't have the segment 😞 😞 😞