Evan Galpin
08/15/2022, 11:43 PMMayank
Evan Galpin
08/16/2022, 12:17 PMEvan Galpin
08/16/2022, 12:23 PMMayank
Mayank
Mayank
Evan Galpin
08/16/2022, 2:29 PMfoo
holding INT values, and 5 rows with the values:
1, 1, 2, 3, 3
Could these rows appear in this order in the segment and be considered sorted (this assumes that the sorted index only holds start and end indexes):
2
1
1
3
3
where the sorted index would look like:
Value start_index end_index
------- ------------- -----------
2 0 0
1 1 2
3 3 4
Evan Galpin
08/16/2022, 2:35 PMEvan Galpin
08/16/2022, 2:45 PMEvan Galpin
08/16/2022, 4:18 PMMayank
2, 1, 1, 3, 3
is not a sorted sequence, so it won’t be considered sorted column in Pinot.Mayank
Evan Galpin
08/16/2022, 4:37 PMIs that based on Pinot doing pre-flight checks of some kind to confirm if data is sorted? Or is it a technical requirement? I'm sure that the examples in docs[1] is simplified, but it does look like sorting all values in the column would not be as vital as ensuring that rows for a certain memberId are contiguous. So long as there is a start and end index for a given memberId and all docs between those start/end have the same memberId, it seems like the intended functionality of sorted index would be supported. Thoughts? [1] https://docs.pinot.apache.org/basics/indexing/forward-index#sorted-forward-index-with-run-length-encodingis not a sorted sequence, so it won’t be considered sorted column in Pinot.2, 1, 1, 3, 3
Mayank
Evan Galpin
08/16/2022, 6:59 PMThere is indeed a check during segment generation that checks ordering of rows and marks as sorted in the metadata.Is this something that could be “forced” via configuration setting of some kind? If data does not need to be sorted globally per segment but rather grouped per memberId, it could be more flexible to allow this?
Note, for realtime segment generation, sorting happens within Pinot.Right, that would be really awesome to take advantage of in a stream processing system. Effectively transplanting realtime segment generation into Flink/Beam/Spark-streaming etc. That way all the features (Ex. sorted index) of segment creation could be taken advantage of in distributed compute frameworks that operate per-element (ex. Flink, Beam) so that processing large amounts of data for backfill could be done without using the resources of the Pinot cluster itself.
Mayank
Evan Galpin
08/16/2022, 7:29 PMMayank