This message was deleted.
# dev
s
This message was deleted.
🙌 7
g
Hi!
v
Welcome! We sure do love DataSketches in Druid ❤️
d
I don't know. They are sketchy. (sorry)
v
Fun fact about DataSketches in Druid: Up until Druid 0.23 DataSketches was only used by Druid as user configurable data types. With the introduction of multi-dimensional range partitioning one year ago in Druid 0.23 DataSketches are now relied upon by 'core' Druid for accurately distributing data between segments. So now even if a user doesn't utilize DataSketches as part of their data schema they are still probably using them internally especially if they use SQL based ingestion, which uses range partitioning by default.
l
@Vadim thanks for the observation. I assume you use one of our quantile sketches for the range partitioning. I am curious: which one of the quantile sketches are you using? e.g., the traditional quantiles, KLL, REQ? What size K are you configuring the sketch with? What is the data type you are feeding the sketch with: Strings or generic items, or ? How big can N be? What is the max # of splits across all dimensions? Finally what is your tolerance for how equal the splits have to be in size? e.g., 1%, 10%? Understanding use cases like this is important for us.
v
All great questions! I am not the sketch expert to answer it - I am more of a sketch cheer leader. That said I believe it is some quantile sketch. I know that as part of SQL based ingestion (24.0) we added downsampling to the mix that was really cool. We also had an issue in a early dev version where the segment sizes were coming out really whacky and it was traced to us hilariously misconfiguring the sketches. I am sure @kfaraz and @Gian Merlino could answer your questions more directly.
l
Configuring the sketches based on your application is where I might be able to help.
k
cc @Adarsh Sanjeev Who has also done a lot work on data sketches when used for partitioning.
• We are using an
ItemsSketch
with K = 32768 • We are adding byte arrays to the sketch which are strings -> bytes. • Theoretically, we donot have any upper bound on N • Number of splits is generally N/3_000_000 We are downsampling the sketch. based on average element sizes. We store the the avg element size * sktech.getN() . If this crosses 300MB we downsample to the k/2.
l
That is a huge value of K and would correspond to a rank error of 64 parts-per-million or 0.0064% !
v
haha, I love parts-per-million. The best unit to measure error and CO2
l
So are you taking a single sketch and giving it ~ 3 million split-points?
v
I think it gets (N - number of rows) / 3000000 splits because it aims to make 3M rows per segment
so like if you have 90M rows in your dataset then 30 splits
l
I don’t understand this sentence:
We are downsampling the sketch. based on average element sizes. We store the the avg element size * sktech.getN() . If this crosses 300MB we downsample to the k/2.
What are you doing exactly?
Now I understand, I misread your bullet above. but I still don’t understand the above sentence. What are you doing when you “downsample the sketch” ?
k
Copy code
Downsample this sketch, dropping about half of the keys that are currently retained
All records are pushed to the sketch. Once the sketch reaches 300 MB we downsample it. After all the records are read and a series of downsampling cycles, if required, we get evenly spaced quantiles by
getQuantiles(sketch.getN()/target_row_batch_size())
These data boundaries are then used to partition the data so that we get even cuts.
l
Are you going inside the sketch and removing half of the retained entries?
Or are you using the downSample(newK) method with a smaller K?
k
We call downSample(newK) method with a smaller K
l
Whew! So why do you feel you need such a large K to start with?
If you start with a smaller K, it will speed up your processing 🙂
k
We were initially on K = 4096 and then we saw usecases where 37B records were going into a single sketch
l
OK, and what happened?
k
With a target size of 5M per partition, the error rate
0.063%
came out to be 3M so we had a distribution like :
Copy code
2M: 2, 4M: 320, 5M: 270, 6M: 6, 7M: 28, 8M: 4023, 9M: 24, 10M: 4, 11M: 4
So we increased the K values so that the error rate reduces.
Since we donot know how many records will come into the system we went with the Biggest K value, and then downsampled stuff as we hit the memory limits.
l
But if you downsize 3 times, you will end up with a 4096 sketch, with the same error. I’ll have to think about this. Are the numbers above the resulting partition sizes of 18 partitions?
Do you have large numbers of duplicate rows?
And how successful is this strategy? How much variance across the partition sizes are you getting?
k
The numbers above indicate num_rows_in_partition->num_partitions. So we had around 4.6k partitions . Our accuracy for the smaller datasets has improved.
Regarding duplicate rows, we cannot control that. There may or may not be duplicates
l
Let me understand your numbers: So at the end of processing one huge number of rows, your numbers represent the distribution of the resulting ~4.6K partitions?
k
Yes
l
So the range of partition sizes was from 2M rows to 11M rows
This is very very interesting.
Hmm, That is not very good. Let me think about this. Could you send me a sample of what a single row might look like that is fed to the sketch? Just one or two.
k
I will need to dig that up. Will send it to you in a few hours ?
l
no problem. Curious, where are you located? India? 🙂
(sorry for the break, my wife called me for dinner 🙂 )
Your little distribution implies 62M rows and 4.68K partitions (like you said).
k
Yes located in India. We were able to repro the issue by pumping in longs (1B) to the item sketch .
l
I have a hunch that the variance you are seeing is not due to sketch error, but due to a likely power-law distribution of duplicates. But I could be wrong. 🙂
Hmmmm. How did you generate the 1B longs?
k
We generated the longs serially in ascending order so that when we get the quantiles we know the segment sizes by just looking at the value.
l
You mean 1,2,3,…., 1B?
k
Yup
l
and you got a similar distribution with a lot of variance?
How many splits did you request?
k
Yes the variance we got: rows_in_partition->num_partitions
1M: 1, 4M: 261, 5M: 268, 7M: 3, 8M: 4091, 11M: 1
It was similar
l
Hmmm, I’m beginning to think we may have a bug in the item sketch.
A few more questions, if you don’t mind. I’d like to reproduce your experiment.
Did you, by chance try feeding the longs into the DoublesSketch?
Did you configure the ItemSketch with a generic <Long>?
Also, the little distribution you show above must be a sampling of the output, because the total rows is only 36M.
I mean sampling of the resulting partitions.
k
Won't the total rows be 33B ?
1M*1+4M*261+7M*3+8M*4091+11M*1
IIRC we did not use the doubleSketch
l
Oh Right! Stupid me, I was misreading you data! 🤯
OK, now I get 35B
k
It was most likely ItemsSketch<Long>
l
But if your input was just 1B rows, what is going on?
Or was this real data?
k
Oops I misread the notes. I donot have the variance for test data
This was most likely the real data.
l
OK, I think I have enough information to reproduce your experiment. I’m also going to try the DoubleSketch to compare. And your experiment also used a lgK of 15?
k
Nope we did not experiment with that
l
Confirming: you set K = 32,768
It is getting late for me, I probably won’t get much time to spend on this until Monday.
k
We set K=4096 for the above experimentation
l
OK, I’ll try K=4096 as well.
k
Thank you!!
l
No problem, I hope you haven’t been struggling with this issue for very long. Anytime you see something strange like this you should put an issue on our datasketches-java repo. That way our whole team can see it. 🙂
Cheers for now,
🙌 2
@Gian Merlino @Vadim @Karan Kumar @Adarsh Sanjeev Here is my initial analysis of what is causing the high variance in your partition sizes. The short answer is that you are inadvertently requesting too many split points given the size of the maximum input stream and the size of the quantile sketch. In other words, the implied resolution you are requesting is very close to or smaller than the error limit of the sketch, thus causing a lot of noise and variation in the resulting partition sizes. To illustrate this I have run some simulations of what you have described to me about your use case and it comes down to these key parameters: • Input N can be as large as 37B. • Classic Quantile Sketch => ItemSketch<String> with K=2^15 • tgtPartSize: target partition size of 3M items. • Number of SplitPoints calculated by N/tgtPartSize. I first decided to model this using the Classic Quantiles DoubleSketch, as handling strings is much much slower. I felt that if I can reproduce the problem with the DoublesSketch then all I have to do is some spot measurements with the ItemSketch to eliminate anything peculiar with the ItemSketch. ( I haven’t done this last step, because what I have so far is quite convincing, I think, but I will.) My model has these parameters: • lgK ranges from 10 to 15 • inputN ranges from 1B to 39B • tgtPartSize = 2.5M items (which is a little worse than (3M). Varying these parameters produces these metrics: • NumPart: number of partitions = SplitPoints+1 • ImpRes; implied resolution = 1/numPart • hMean: average resulting partition size • hMax: maximum resulting partition size • hMin: minimum resulting partition size • hRange: hMax - hMin • hRange% = hRange/hMean *100 • Variance; the computed variance of all the partition sizes • StdDev: the computed standard deviation of all the partition sizes • StdDev/Mean%: • SkErr: the error of the sketch determined by K • Retained items: This gives a sense of the size of the sketch (in items, not bytes) (to be continued)
g
yeah I think that makes sense: intuitively, we are asking for a lot of cut-points, meaning that we're very sensitive to errors. if we have 10 B input items, we're going to want about 2,000 cut-points, which means we want to get the 0.0005 ranked item, the 0.001 ranked item, the 0.0015 ranked item, and so on. so rank error has to be even smaller than that
is that matching what you are finding as well?
l
(Continuing) Let’s look at the following chart:
This first chart confirms that as we configure the sketch with larger K, Not only does the native error of the sketch go down (green), so does the range of the variation in sizes (hRange%) of the partition reduce as well as the standard deviation of that variation (StdDev/Mean%). These two last metrics are normalized by the mean size of the partitions so we can get a sense of the impact. As you can see from the title of the chart, N=1B, and with a target partition size of 2.5M, this results in only 400 partitions. Thus, 1/400 => 2.5e-3 implied resolution (black line). The dashed curves are plotted on the right vertical axis. The key metrics of hRange% and StdDev/Mean% certainly improve with larger K.
(Continuing) Now let’s look at the next couple of plots:
The chart on the left is much more revealing. Now I’m plotting against N keeping the other parameters constant. Here N varies from 1e9 items to 39e9 items in steps of 2e9. LgK is constant at 15, the TgtPartSize is constant at 2.5e6, and since the sketch is a constant size the SkErr is constant at 8e-5. The black curve now decreases dramatically with each increase in N, where at the largest N, it actually crosses the green line and becomes less than the sketch error. This is bad news. And you can see the results in the erratic plots of the hRange% and the StdDev/Mean%. And as you can see at N=35B the variation in partition size is about 52% of the average size of the partitions! This explains the wild variations you are seeing. The StdDev/Mean% is smaller, because if you examine the Gaussian, +1 StdDev is only about 34% of the area under the curve. The second chart just confirms the underlying cause: If you always keep the target size of the resulting partitions the same size (here 2.5M items), as N increases, it demands more and more splits from the sketch, to the point where it demands more resolution from the sketch than it is designed to handle and you get very noisy results.
🙌 1
(Continuing) We have had quite a bit of internal discussion in our team about this and here are some of our thoughts. • Frankly, this use case has come as a bit of a surprise. When we designed these sketches, we anticipated large N, and that’s not the problem. What we didn’t anticipate was large N and large number of split-points. And this specific use-case requires values of K much larger than what we originally anticipated anyone ever needing. • We feel this needs a new sketch design specifically targeted for very large N, and very large # of split points, where K can scale much larger than 2^15. This is why we would like to huddle with you folks to understand more about your environment and your constraints. I hope you find this analysis interesting. Cheers,
@Gian Merlino I think this analysis answers your question 🙂
g
i find this very interesting! especially the tantalizing prospect of a sketch more closely suited to our needs 🙂
k
Super exciting stuff!!
l
@Karan Kumar Question, to obtain the final partitions, what sketch functions are you using: getQuantiles(…), getPMF(…), getCDF(…)?
And then what do you do to create the partitions?
k
Question, to obtain the final partitions, what sketch functions are you using: getQuantiles(…), getPMF(…), getCDF(…)
We are using
sketch.getQuantiles()
We get the various ranks and partition the data per worker according to them.
l
The
T[] sketch.getQuantiles(int evenlySpaced)
, returns an array of quantile values that define an array of buckets where each bucket is assigned a value from the quantile array. Then you rescan the input stream and for each item you place it in a bucket where the value of the item is <= a bucket value and > than the bucket just below it. Is this correct?
This will produce a quasi-sorted partition array, where the overall array is sorted but the values in each partition will not be sorted, but somewhat “close” together. So now my question is. Why is this quasi-sorted property important?
a
Yes, that is correct. Druid uses primary partitioning by timestamp (the first value of each array is normally that, if that is relevant), so the property we need is to ensure that any change in timestamp value while getting quantiles would be a split point (I think this is something quantile sketches don't really handle well). If we are say, partitioning by month, we use a different bucket for each month and use a different sketch to guarantee this property.
l
the property we need is to ensure that any change in timestamp value while getting quantiles would be a split point (I think this is something quantile sketches don’t really handle well).
This is the first I have heard of this issue. Has this been reported to us as an issue on our website? If we don’t know about an issue, we can’t investigate it 🙂 . I can guess that it may be because quantile sketches are a stochastic sampling process, and in that process some date transitions get skipped over resulting in a transition that effectively occurs at some other nearby point but not exactly where the transition occurs in the raw data (but I can’t be sure). If you know the first part of your input string is a date field, why don’t you just do an initial scan over your data and spray the items into date buckets? You don’t need a sketch for that.
If we are say, partitioning by month, we use a different bucket for each month and use a different sketch to guarantee this property.
What different sketch?
g
If you know the first part of your input string is a date field, why don’t you just do an initial scan over your data and spray the items into date buckets? You don’t need a sketch for that.
This is basically what we're doing today. The issue we have is we have a certain memory budget to use across all buckets, so because we're managing one sketch per bucket, we need to manage that as we add things to the various buckets. Whenever we're about to exceed our memory budget, we pick a bucket and call
downSample
on the sketch for that bucket. The way we pick the bucket… "works" but is probably not ideal.
This is the first I have heard of this issue. Has this been reported to us as an issue on our website?
No, since we figured it was out of scope for any of the sketches you have
v
Heads up everyone I made a new channel called #C04TB8M30DV for everything DataSketches related, please join if interested.