Could anyone explain why Pinot is using GroupByOrderByCombin Apache Pinot #general

Could anyone explain why Pinot is using GroupByOrd...

Damiano

08/01/2020, 9:20 AM

Could anyone explain why Pinot is using GroupByOrderByCombineOperator instead of GroupByCombineOperator for queries like:

select max(profit) from transcript group by strategy_id

Xiang Fu

08/01/2020, 9:49 AM

This is used in SQL mode

Xiang Fu

08/01/2020, 9:50 AM

image.png

Xiang Fu

08/01/2020, 9:50 AM

If u read through the code in

CombinePlanNode

Xiang Fu

08/01/2020, 9:51 AM

GroupByCombineOperator

is the old PQL implementation which will sort results by the aggregated values. You can think of this is operator always append the clause

ORDER BY max(profit) DESC

into your query.

Xiang Fu

08/01/2020, 9:51 AM

GroupByOrderByCombineOperator is new SQL implementation which will take the sorting params from the query, could be null, in your sample query case.

Damiano

08/01/2020, 9:52 AM

ok, got it! so using SQL it will never be called, right?

Xiang Fu

08/01/2020, 9:55 AM

correct, SQL syntax will always call GroupByOrderByCombineOperator

👍 1

Damiano

08/01/2020, 10:04 AM

@Xiang Fu for you what is the correct place to put a method that must be called at "server level" (not broker) that can handle the whole sequence of blocks

Damiano

08/01/2020, 10:05 AM

my aggregation need the whole sequence of blocks to do the logic (instead of using aggregate() method for each block

Xiang Fu

08/01/2020, 10:05 AM

hmm

Damiano

08/01/2020, 10:05 AM

i think in the CombineOperator we were talking about, right ?

Xiang Fu

08/01/2020, 10:05 AM

so you need all the blocks to compute and need to hold all the blocks in memory?

Xiang Fu

08/01/2020, 10:06 AM

no intermediate state?

Damiano

08/01/2020, 10:06 AM

@Xiang Fu exactly...i know it could sounds "strange" but i am implementing something like window functions in Pinot, so i must work with the entire sequence of blocks becuase i need to do a cumulative sum

Xiang Fu

08/01/2020, 10:06 AM

CombineOperator is used to combine multiple other operators and expected the things to merge are partial

Damiano

08/01/2020, 10:07 AM

exactly, i thought after that merge i can work with the entire sequence of blocks no?

Xiang Fu

08/01/2020, 10:09 AM

I don’t think so, the place to look is the

InstanceResponseOperator

Xiang Fu

08/01/2020, 10:09 AM

I suggest you go over the example query plan first

Xiang Fu

08/01/2020, 10:10 AM

then we can think of how to do this

Xiang Fu

08/01/2020, 10:10 AM

also it would be great to create an issue in github for window function

Xiang Fu

08/01/2020, 10:10 AM

with some sample queries, so we can get more insights and recommendations from the community

Damiano

08/01/2020, 10:11 AM

yeah i think it could be great. Druid for example has an ext for window functions

Damiano

08/01/2020, 10:13 AM

@Xiang Fu wait moving the logic in InstanceResponseOperator, does the Broker run that code?

Xiang Fu

08/01/2020, 10:17 AM

I suggest we go through one example ?

Xiang Fu

08/01/2020, 10:17 AM

instanceBlock is the block after all combines and just before converting it to dataTable and sending back to broker

Xiang Fu

08/01/2020, 10:19 AM

one thing i’m a bit confusing is that if we cannot process partial data in a segment/block, then it means we cannot process it in server also as server only holds partial data as well.

Damiano

08/01/2020, 10:20 AM

@Xiang Fu you are right, if we want to do something that can be used always we need to move the logic to broker

Damiano

08/01/2020, 10:20 AM

but in my case i will use partitions so i know that all the data is inside a single node

Xiang Fu

08/01/2020, 10:22 AM

then in the aggregationfunction, try to hold the data

Xiang Fu

08/01/2020, 10:22 AM

and do your logic in reduce method

Damiano

08/01/2020, 10:22 AM

yes i can "concatenate" the blocks

Damiano

08/01/2020, 10:23 AM

Copy code

extractFinalResult()

^^ there?

Damiano

08/01/2020, 10:23 AM

is that method not called from broker?

Xiang Fu

08/01/2020, 10:24 AM

yes

Xiang Fu

08/01/2020, 10:24 AM

extractFinalResult

Xiang Fu

08/01/2020, 10:24 AM

but

extractFinalResult

will be called in broker

Xiang Fu

08/01/2020, 10:24 AM

hmmm

Damiano

08/01/2020, 10:24 AM

but in that way i will move all the loginc into broker so not good

Damiano

08/01/2020, 10:24 AM

exactly, thats the problem

Damiano

08/01/2020, 10:25 AM

i should avoid moving all the load into the broker

Damiano

08/01/2020, 10:26 AM

do you know CItus? in Citus there is a distribution keys, so each shard has all the record of that specific partition key, i would like to do something like that in Pinot. Partition could be good to isolate rows having the same key (i will use in

group by

)

Damiano

08/01/2020, 10:26 AM

but moving all the load to the broker is bad because then it has to handle all the nodes responses...heavy load.

Xiang Fu

08/01/2020, 10:27 AM

true

Xiang Fu

08/01/2020, 10:27 AM

then basically we need to have a way to do inner/inter server level aggregation(merge)

Xiang Fu

08/01/2020, 10:28 AM

in that mode we need to ensure over partitioning and all shard data are in same server

Damiano

08/01/2020, 10:29 AM

yes, correct

Xiang Fu

08/01/2020, 10:29 AM

hmm

Damiano

08/01/2020, 10:29 AM

so for that reason i was thinking where to put the logic

Xiang Fu

08/01/2020, 10:29 AM

could you share the query somehow

Damiano

08/01/2020, 10:29 AM

in postgresql is

Xiang Fu

08/01/2020, 10:29 AM

I’m thinking maybe we can run query twice

Damiano

08/01/2020, 10:30 AM

Copy code

explain analyze 
WITH t2 AS (SELECT strategy_id, MAX(profit) OVER (PARTITION BY strategy_id ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - profit as drawdown    
	FROM trades)
SELECT strategy_id, MAX(drawdown) max_drawdown
FROM t2
GROUP BY strategy_id
ORDER BY max_drawdown;

Xiang Fu

08/01/2020, 10:30 AM

first round is to extract the sequence/stats, then second query will to the aggregation

Damiano

08/01/2020, 10:30 AM

for a simple table like :

Damiano

08/01/2020, 10:30 AM

Copy code

CREATE TABLE trades (
   id Serial,
   strategy_id Integer,
   profit Integer,
   PRIMARY KEY (strategy_id, id)
);

Damiano

08/01/2020, 10:31 AM

i need the running max and then i need to subtract it to a value

Damiano

08/01/2020, 10:31 AM

....and after all getting the MAX

Xiang Fu

08/01/2020, 10:33 AM

Xiang Fu

08/01/2020, 10:33 AM

so need to run

select strategy_id, max(profit) group by strategy_id

first

Xiang Fu

08/01/2020, 10:34 AM

then do

Copy code

SELECT strategy_id, MAX(max_profit - profit) as max_drawdown
FROM t2
GROUP BY strategy_id

Xiang Fu

08/01/2020, 10:35 AM

just the

max_profit

values are different per

strategy_id

Damiano

08/01/2020, 10:35 AM

I Need to do max(drawdown) the drawdown is the runnin max - profit

Damiano

08/01/2020, 10:36 AM

yes everything grouped by strategy_id

Xiang Fu

08/01/2020, 10:37 AM

what’s the cardinality of

strategy_id

Xiang Fu

08/01/2020, 10:37 AM

have you tried presto as a workaround right now ?

Damiano

08/01/2020, 10:37 AM

the max(profit) is the running max

Xiang Fu

08/01/2020, 10:37 AM

the first max group by query should be able to push down to pinot to execute

Damiano

08/01/2020, 10:37 AM

no i havent used Presto because talking with Kishore he told me that it will be slower because moves everything to the Presto worker

Xiang Fu

08/01/2020, 10:37 AM

the outer aggregation will be processed by presto

Xiang Fu

08/01/2020, 10:38 AM

hmmm

Xiang Fu

08/01/2020, 10:38 AM

true

Xiang Fu

08/01/2020, 10:38 AM

as the out query will also read from pinot

Xiang Fu

08/01/2020, 10:38 AM

it’s actually a join on

strategy_id

🙂

Damiano

08/01/2020, 10:40 AM

what do you exactly mean? moving the running max on presto and then other on pinot?

Xiang Fu

08/01/2020, 10:40 AM

I think it will not work

Xiang Fu

08/01/2020, 10:40 AM

nvm

Xiang Fu

08/01/2020, 10:40 AM

presto still needs to read the entire table in order to do the computation

Damiano

08/01/2020, 10:41 AM

doing this aggregation is easy if we work on the entire sequence the steps are:

Xiang Fu

08/01/2020, 10:41 AM

since the outer aggregation cannot be pushed down

Damiano

08/01/2020, 10:41 AM

1. working with sorted rows

Damiano

08/01/2020, 10:41 AM

2. just one loop over the entire sequence saving the running max

Damiano

08/01/2020, 10:41 AM

3. simple subtraction

Xiang Fu

08/01/2020, 10:42 AM

right

Damiano

08/01/2020, 10:42 AM

sure ...this logic for ALL the strategies (group by strategy_id)

Damiano

08/01/2020, 10:42 AM

where can i do a fast test for it ? i mean...maybe not the best solution but where i can get the whole sequence to put this logic?

Xiang Fu

08/01/2020, 10:43 AM

I feel we may need to add a new interface in aggregationFunction to separate server and broker extract Results logic

Damiano

08/01/2020, 10:43 AM

YEAH

Damiano

08/01/2020, 10:43 AM

that's will be a very smart solution

Xiang Fu

08/01/2020, 10:43 AM

we made an assumption that partial results are same across server and broker

Damiano

08/01/2020, 10:43 AM

because of the partitions, right?

Xiang Fu

08/01/2020, 10:44 AM

right

Damiano

08/01/2020, 10:44 AM

yes i think without partitions...the job must be done via broker

Damiano

08/01/2020, 10:44 AM

because all the rows with the same strategy_id must be on the same node

Damiano

08/01/2020, 10:44 AM

(mandatory)

Xiang Fu

08/01/2020, 10:47 AM

check this method

Xiang Fu

08/01/2020, 10:48 AM

Copy code

/**
   * Extracts the intermediate result from the aggregation result holder (aggregation only).
   * TODO: Support serializing/deserializing null values in DataTable and use null as the empty intermediate result
   */
  IntermediateResult extractAggregationResult(AggregationResultHolder aggregationResultHolder);

Xiang Fu

08/01/2020, 10:48 AM

AggregationFunction

Xiang Fu

08/01/2020, 10:48 AM

you can create your own holder

Xiang Fu

08/01/2020, 10:48 AM

which keeps all the blocks

Xiang Fu

08/01/2020, 10:50 AM

just in

extractAggregationResult

you do the scan twice

Xiang Fu

08/01/2020, 10:50 AM

also you can keep a separated max during the merge/scan

Xiang Fu

08/01/2020, 10:51 AM

basically the

AggregationResultHolder

is a tuple of (Double maxProfit, List<Double> profit)

Damiano

08/01/2020, 10:51 AM

a fast recap, aggregateGroupBySV() i will add all values into groupByResultHolder and then in extractAggregationResult i will have all the values i need for computation right?

Xiang Fu

08/01/2020, 10:51 AM

yes

Xiang Fu

08/01/2020, 10:52 AM

that’s the simplest thing you can give a try

Xiang Fu

08/01/2020, 10:52 AM

one example you can follow avg

Xiang Fu

08/01/2020, 10:52 AM

avg function keeps count and sum in the resultHolder

Damiano

08/01/2020, 10:52 AM

ok so "extractAggregationResult() will be called after all the blocks

Damiano

08/01/2020, 10:52 AM

after all right?

Damiano

08/01/2020, 10:53 AM

i mean after that method Pinot will not process other blocks right?

Xiang Fu

08/01/2020, 10:53 AM

yes

Xiang Fu

08/01/2020, 10:53 AM

in server level

Xiang Fu

08/01/2020, 10:54 AM

and you still need to save all the block values into your

AggregationResultHolder

Xiang Fu

08/01/2020, 10:54 AM

e.g. use a set to remove duplicates values

Damiano

08/01/2020, 10:54 AM

yes ok, very good...but there is another important thing to consider...i MUST deal with ordered sequence, for example my aggregator should be select... DRAWDOWN(id, profit) it means that i need to sort by id and then do the logic (running max) over profit column

Damiano

08/01/2020, 10:55 AM

in this situation is better to create a sorted index on ID column ?

Xiang Fu

08/01/2020, 10:55 AM

yes, but that won’t guaranttee the ordering across multiple segments

Xiang Fu

08/01/2020, 10:55 AM

if those segments are processed in parallel

Damiano

08/01/2020, 10:56 AM

so blocks can have random order

Xiang Fu

08/01/2020, 10:56 AM

right

Damiano

08/01/2020, 10:56 AM

ok but extractAggregationResult will be called ONE time only...so maybe i can do a sort() inside that method? or maybe when i add the values i can use a list that is already ordered

Damiano

08/01/2020, 10:56 AM

what do you think?

Xiang Fu

08/01/2020, 10:56 AM

you can do that for now

Xiang Fu

08/01/2020, 10:57 AM

or you need to alter query plan to make the query execution in sequential and order segments as well

Damiano

08/01/2020, 10:58 AM

but in this case will be slower for sure no?

Damiano

08/01/2020, 10:58 AM

i think it is not a problem maybe i can do in parallel and then add a sort inside extractAggregationResult()

Damiano

08/01/2020, 10:59 AM

or as i told yuo i can use a sorted map in AggregationResultHolder

Damiano

08/01/2020, 11:00 AM

@Xiang Fu maybe a SortedMap object ?

Xiang Fu

08/01/2020, 11:11 AM

you can try both 🙂

Xiang Fu

08/01/2020, 11:11 AM

I think sorted map will work

Damiano

08/01/2020, 11:39 AM

Damiano

08/01/2020, 11:39 AM

i try

Damiano

08/01/2020, 11:39 AM

thank you so much

Damiano

08/01/2020, 12:31 PM

@Xiang Fu still there? with batch ingestion how can i specify the partition of my segment? for example i have three files in /mydir/ a.csv b.csv c.csv I know Pinot will create three segments, is there a way in yaml to specific the partitions?

Xiang Fu

08/01/2020, 10:56 PM

@Jackie could you provide more instructions here for data partitioning

Jackie

08/01/2020, 11:48 PM

@Damiano You can config the partition for a table within the table config. When creating segment, if the source data is already partitioned, Pinot can automatically figure out the partition of the segment

Damiano

08/02/2020, 3:53 PM

@Jackie ok I found the settings segmentPartitionConfig. Did you mean that right? However for batch ingestion I did not find the way to specify which segments to put in partitions. Does Pinot will understand it automatically looking at the field I set in the config and the PartitionFunction (like MURMUR, Modulo etc), or do I need to format the input in some manner? ( I use csv)

Jackie

08/02/2020, 6:37 PM

@Damiano Yes. In order to make partitioning work, the input data should already be partitioned with the PartitionFunction specified in the partition config. When creating segment, Pinot will read the partition config and use it to figure out the partition of the input file.

Damiano

08/02/2020, 9:00 PM

@Jackie ok, just a fast example to understand it better. I create the schema of the table setting the number of the partitions(10) the function i would like to use (Modulo) and then regarding the batch ingestion, let's suppose a CSV with 10M of rows, at the moment to create the segments I split the big csv with 10M rows into 100 smaller CSVs that have 100.000 rows each. Can I continue using this setup or do I have to format my csv in a different manner? In this example what does Pinot do internally? Will assign those 100 segments to my 10 partitions? Or can I simple ingest the big csv because then Pinot will assign each row to the correct partition looking at the partition key? If true, how does Pinot will create segments INSIDE each partition? Thank you!!

Jackie

08/02/2020, 9:13 PM

@Damiano When you split the CSV file, you should partition them with the same function configured in the Pinot config so that each file only contains rows for one partition. When Pinot creates the segments, it will use the partition config to match the input file and figure out which partition it belongs to, then assign it accordingly

Damiano

08/02/2020, 9:55 PM

@Jackie do I have to use a specific name for the csv? I do not know maybe 1.csv for partition one... 2.csv for partition two and so on... However before using partition I splitted the big csv to create segments. Now I have to split the csv to create partitions, right? In this case how can I create the segments inside the partitions? Because if I create 10 partitions so there will be 10 CSVs, so how does Pinot will create the segments inside them?

Jackie

08/02/2020, 10:46 PM

No, there is no requirement for the input file name. When you split the csv file, you can create multiple files for each partition (e.g. 10 segments per partition, or 9 for partition 0 and 11 for partition 1). Pinot will create one segment per input file

Damiano

08/02/2020, 11:23 PM

wait @Jackie i am still a bit confused about that, sorry. Before you said that i need to split my big CSV into smaller files to "group" rows with the same partition key. So, for 10 partitions i should create 10 files. I miss the second part you just wrote. If i can create multiple files, for example, splitting my 10M file into 1000 smaller files. It means that every file will have ~10k rows. I suppose that i should group in the same segment (csv file) the rows with the same partition key, and then Pinot will assign that segment (all the rows of that csv file) to the partition. Right? The important thing is putting in the same segment the rows with the same key, then Pinot internally will assign it to the partition 0 or 1 or 2 etc... Ok? Is correct what i just said? Lastly, if i have an offline table where i ingest data everyday (adding new rows via CSV), should i do a similar thing? I mean, my table has 10M rows, then tomorrow i need add new data in that table, lets suppose 10 rows in total. In that case should i create 10 files (one per row) to match the partition key, right? Obviously if two rows have the same partition i will put both into the same CSV, but is the logic correct? Thank you very much!

Jackie

08/03/2020, 12:11 AM

@Damiano There is only one requirement for partitioning to work: all the rows within the same input file are within the same partition.

Jackie

08/03/2020, 12:13 AM

But if the volume of the data is small (e.g. less than 1000 rows per day), then I don’t think the data is worth partitioning because of the overhead of processing tiny segments

Damiano

08/03/2020, 2:37 AM

@Jackie I read on Slack that Pinot will merge small segments. I think it is a feature under development. However in my case I must use partitions to ensure all the rows of a partitioning key are in the same node and not splitted in the whole cluster.

Jackie

08/03/2020, 4:31 AM

Can you elaborate more on your use case? What is the total data size and how much data will you generate every day?

Damiano

08/03/2020, 7:03 AM

@Jackie I am using Pinot on many projects, but the one where I need partitions is a project that needs to analyze stocks. I do backtests of strategies using Pinot to retrieve statistics of strategies trades. As I told you, I need partitions because I am implementing a similar logic of window functions, so I must deal with the entire sequence of rows(for example to calculate the running sum). For this reason I need "windows". Another example could be calculating the moving averages, so the job is (a) add the historical trades of the strategies to monitor, the 10M i said. (b) adding new trades everyday. Partitions are a must in this case because if I do not use them I must move the logic of the windows from the server to the broker, and that's no good.

Jackie

08/03/2020, 6:10 PM

@Damiano I see. In that case, you need to create and push 10 segments per day (one per partition) everyday. The reason why I asked about the data size is because for small amount of data (e.g. ~10M rows) you can also consider refreshing the whole table everyday instead of keep adding tiny segments.

Damiano

08/03/2020, 6:16 PM

@Jackie yes i can do that too, it is not a problem it will take few seconds. But i think with the next release this problem should be solved because they are implementing a merge of segments. However yes, i will refresh the entire table everyday, thank you!

Open in Slack

Previous Next