Slackbot
03/01/2023, 5:41 AMGian Merlino
03/01/2023, 7:10 AMVadim
03/01/2023, 4:53 PMDavid McHealy
03/01/2023, 5:00 PMVadim
03/01/2023, 5:07 PMLee Rhodes
03/01/2023, 7:06 PMVadim
03/01/2023, 7:13 PMLee Rhodes
03/01/2023, 7:16 PMKaran Kumar
03/02/2023, 11:39 AMKaran Kumar
03/02/2023, 11:53 AMItemsSketch
with K = 32768
• We are adding byte arrays to the sketch which are strings -> bytes.
• Theoretically, we donot have any upper bound on N
• Number of splits is generally N/3_000_000
We are downsampling the sketch. based on average element sizes. We store the the avg element size * sktech.getN() . If this crosses 300MB we downsample to the k/2.Lee Rhodes
03/02/2023, 9:52 PMVadim
03/02/2023, 9:53 PMLee Rhodes
03/02/2023, 9:53 PMVadim
03/02/2023, 9:55 PMVadim
03/02/2023, 9:55 PMLee Rhodes
03/02/2023, 9:56 PMLee Rhodes
03/02/2023, 9:56 PMWe are downsampling the sketch. based on average element sizes. We store the the avg element size * sktech.getN() . If this crosses 300MB we downsample to the k/2.
Lee Rhodes
03/02/2023, 9:57 PMLee Rhodes
03/02/2023, 10:01 PMKaran Kumar
03/04/2023, 5:06 PMDownsample this sketch, dropping about half of the keys that are currently retained
All records are pushed to the sketch.
Once the sketch reaches 300 MB we downsample it.
After all the records are read and a series of downsampling cycles, if required, we get evenly spaced quantiles by getQuantiles(sketch.getN()/target_row_batch_size())
These data boundaries are then used to partition the data so that we get even cuts.Lee Rhodes
03/04/2023, 8:02 PMLee Rhodes
03/04/2023, 8:16 PMKaran Kumar
03/05/2023, 2:31 AMLee Rhodes
03/05/2023, 2:33 AMLee Rhodes
03/05/2023, 2:35 AMKaran Kumar
03/05/2023, 2:44 AMLee Rhodes
03/05/2023, 2:45 AMKaran Kumar
03/05/2023, 2:46 AM0.063%
came out to be 3M so we had a distribution like :
2M: 2, 4M: 320, 5M: 270, 6M: 6, 7M: 28, 8M: 4023, 9M: 24, 10M: 4, 11M: 4
So we increased the K values so that the error rate reduces.Karan Kumar
03/05/2023, 2:48 AMLee Rhodes
03/05/2023, 2:51 AMLee Rhodes
03/05/2023, 2:51 AMLee Rhodes
03/05/2023, 2:54 AMKaran Kumar
03/05/2023, 3:00 AMKaran Kumar
03/05/2023, 3:02 AMLee Rhodes
03/05/2023, 3:06 AMKaran Kumar
03/05/2023, 3:08 AMLee Rhodes
03/05/2023, 3:08 AMLee Rhodes
03/05/2023, 3:08 AMLee Rhodes
03/05/2023, 3:11 AMKaran Kumar
03/05/2023, 3:13 AMLee Rhodes
03/05/2023, 4:17 AMLee Rhodes
03/05/2023, 4:17 AMLee Rhodes
03/05/2023, 4:26 AMKaran Kumar
03/05/2023, 4:28 AMLee Rhodes
03/05/2023, 4:28 AMLee Rhodes
03/05/2023, 4:29 AMKaran Kumar
03/05/2023, 4:33 AMLee Rhodes
03/05/2023, 4:34 AMKaran Kumar
03/05/2023, 4:35 AMLee Rhodes
03/05/2023, 4:35 AMLee Rhodes
03/05/2023, 4:36 AMKaran Kumar
03/05/2023, 4:37 AM1M: 1, 4M: 261, 5M: 268, 7M: 3, 8M: 4091, 11M: 1
Karan Kumar
03/05/2023, 4:38 AMLee Rhodes
03/05/2023, 4:41 AMLee Rhodes
03/05/2023, 4:42 AMLee Rhodes
03/05/2023, 4:43 AMLee Rhodes
03/05/2023, 4:44 AMLee Rhodes
03/05/2023, 4:45 AMLee Rhodes
03/05/2023, 4:52 AMKaran Kumar
03/05/2023, 4:52 AMKaran Kumar
03/05/2023, 4:52 AMKaran Kumar
03/05/2023, 4:53 AMLee Rhodes
03/05/2023, 4:55 AMLee Rhodes
03/05/2023, 4:56 AMKaran Kumar
03/05/2023, 4:57 AMLee Rhodes
03/05/2023, 4:58 AMLee Rhodes
03/05/2023, 5:00 AMKaran Kumar
03/05/2023, 5:02 AMKaran Kumar
03/05/2023, 5:03 AMLee Rhodes
03/05/2023, 5:05 AMKaran Kumar
03/05/2023, 5:06 AMLee Rhodes
03/05/2023, 5:07 AMLee Rhodes
03/05/2023, 5:09 AMKaran Kumar
03/05/2023, 5:10 AMLee Rhodes
03/05/2023, 5:10 AMKaran Kumar
03/05/2023, 5:10 AMLee Rhodes
03/05/2023, 5:12 AMLee Rhodes
03/05/2023, 5:12 AMLee Rhodes
03/10/2023, 2:06 AMGian Merlino
03/10/2023, 2:19 AMGian Merlino
03/10/2023, 2:20 AMLee Rhodes
03/10/2023, 5:17 AMLee Rhodes
03/10/2023, 5:30 AMLee Rhodes
03/10/2023, 5:31 AMLee Rhodes
03/10/2023, 5:49 AMLee Rhodes
03/10/2023, 6:00 AMLee Rhodes
03/10/2023, 6:11 AMGian Merlino
03/10/2023, 9:12 AMKaran Kumar
03/10/2023, 1:49 PMLee Rhodes
03/11/2023, 1:59 AMLee Rhodes
03/11/2023, 2:03 AMKaran Kumar
03/11/2023, 5:00 AMQuestion, to obtain the final partitions, what sketch functions are you using: getQuantiles(…), getPMF(…), getCDF(…)We are using
sketch.getQuantiles()
We get the various ranks and partition the data per worker according to them.Lee Rhodes
03/11/2023, 6:49 PMT[] sketch.getQuantiles(int evenlySpaced)
, returns an array of quantile values that define an array of buckets where each bucket is assigned a value from the quantile array.
Then you rescan the input stream and for each item you place it in a bucket where the value of the item is <= a bucket value and > than the bucket just below it. Is this correct?Lee Rhodes
03/11/2023, 6:54 PMAdarsh Sanjeev
03/12/2023, 8:44 AMLee Rhodes
03/12/2023, 7:57 PMthe property we need is to ensure that any change in timestamp value while getting quantiles would be a split point (I think this is something quantile sketches don’t really handle well).This is the first I have heard of this issue. Has this been reported to us as an issue on our website? If we don’t know about an issue, we can’t investigate it 🙂 . I can guess that it may be because quantile sketches are a stochastic sampling process, and in that process some date transitions get skipped over resulting in a transition that effectively occurs at some other nearby point but not exactly where the transition occurs in the raw data (but I can’t be sure). If you know the first part of your input string is a date field, why don’t you just do an initial scan over your data and spray the items into date buckets? You don’t need a sketch for that.
If we are say, partitioning by month, we use a different bucket for each month and use a different sketch to guarantee this property.What different sketch?
Gian Merlino
03/13/2023, 9:05 PMIf you know the first part of your input string is a date field, why don’t you just do an initial scan over your data and spray the items into date buckets? You don’t need a sketch for that.This is basically what we're doing today. The issue we have is we have a certain memory budget to use across all buckets, so because we're managing one sketch per bucket, we need to manage that as we add things to the various buckets. Whenever we're about to exceed our memory budget, we pick a bucket and call
downSample
on the sketch for that bucket. The way we pick the bucket… "works" but is probably not ideal.Gian Merlino
03/13/2023, 9:06 PMThis is the first I have heard of this issue. Has this been reported to us as an issue on our website?No, since we figured it was out of scope for any of the sketches you have
Vadim
03/13/2023, 10:28 PM