This message was deleted Apache Druid #dev

Join Slack

This message was deleted.

# dev

Slackbot

03/01/2023, 5:41 AM

This message was deleted.

🙌 7

Gian Merlino

03/01/2023, 7:10 AM

Hi!

Vadim

03/01/2023, 4:53 PM

Welcome! We sure do love DataSketches in Druid ❤️

David McHealy

03/01/2023, 5:00 PM

I don't know. They are sketchy. (sorry)

Vadim

03/01/2023, 5:07 PM

Fun fact about DataSketches in Druid: Up until Druid 0.23 DataSketches was only used by Druid as user configurable data types. With the introduction of multi-dimensional range partitioning one year ago in Druid 0.23 DataSketches are now relied upon by 'core' Druid for accurately distributing data between segments. So now even if a user doesn't utilize DataSketches as part of their data schema they are still probably using them internally especially if they use SQL based ingestion, which uses range partitioning by default.

Lee Rhodes

03/01/2023, 7:06 PM

@Vadim thanks for the observation. I assume you use one of our quantile sketches for the range partitioning. I am curious: which one of the quantile sketches are you using? e.g., the traditional quantiles, KLL, REQ? What size K are you configuring the sketch with? What is the data type you are feeding the sketch with: Strings or generic items, or ? How big can N be? What is the max # of splits across all dimensions? Finally what is your tolerance for how equal the splits have to be in size? e.g., 1%, 10%? Understanding use cases like this is important for us.

Vadim

03/01/2023, 7:13 PM

All great questions! I am not the sketch expert to answer it - I am more of a sketch cheer leader. That said I believe it is some quantile sketch. I know that as part of SQL based ingestion (24.0) we added downsampling to the mix that was really cool. We also had an issue in a early dev version where the segment sizes were coming out really whacky and it was traced to us hilariously misconfiguring the sketches. I am sure @kfaraz and @Gian Merlino could answer your questions more directly.

Lee Rhodes

03/01/2023, 7:16 PM

Configuring the sketches based on your application is where I might be able to help.

Karan Kumar

03/02/2023, 11:39 AM

cc @Adarsh Sanjeev Who has also done a lot work on data sketches when used for partitioning.

Karan Kumar

03/02/2023, 11:53 AM

• We are using an

ItemsSketch

with K = 32768 • We are adding byte arrays to the sketch which are strings -> bytes. • Theoretically, we donot have any upper bound on N • Number of splits is generally N/3_000_000 We are downsampling the sketch. based on average element sizes. We store the the avg element size * sktech.getN() . If this crosses 300MB we downsample to the k/2.

Lee Rhodes

03/02/2023, 9:52 PM

That is a huge value of K and would correspond to a rank error of 64 parts-per-million or 0.0064% !

Vadim

03/02/2023, 9:53 PM

haha, I love parts-per-million. The best unit to measure error and CO2

Lee Rhodes

03/02/2023, 9:53 PM

So are you taking a single sketch and giving it ~ 3 million split-points?

Vadim

03/02/2023, 9:55 PM

I think it gets (N - number of rows) / 3000000 splits because it aims to make 3M rows per segment

Vadim

03/02/2023, 9:55 PM

so like if you have 90M rows in your dataset then 30 splits

Lee Rhodes

03/02/2023, 9:56 PM

I don’t understand this sentence:

Lee Rhodes

03/02/2023, 9:56 PM

We are downsampling the sketch. based on average element sizes. We store the the avg element size * sktech.getN() . If this crosses 300MB we downsample to the k/2.

Lee Rhodes

03/02/2023, 9:57 PM

What are you doing exactly?

Lee Rhodes

03/02/2023, 10:01 PM

Now I understand, I misread your bullet above. but I still don’t understand the above sentence. What are you doing when you “downsample the sketch” ?

Karan Kumar

03/04/2023, 5:06 PM

Copy code

Downsample this sketch, dropping about half of the keys that are currently retained

All records are pushed to the sketch. Once the sketch reaches 300 MB we downsample it. After all the records are read and a series of downsampling cycles, if required, we get evenly spaced quantiles by

getQuantiles(sketch.getN()/target_row_batch_size())

These data boundaries are then used to partition the data so that we get even cuts.

Lee Rhodes

03/04/2023, 8:02 PM

Are you going inside the sketch and removing half of the retained entries?

Lee Rhodes

03/04/2023, 8:16 PM

Or are you using the downSample(newK) method with a smaller K?

Karan Kumar

03/05/2023, 2:31 AM

We call downSample(newK) method with a smaller K

Lee Rhodes

03/05/2023, 2:33 AM

Whew! So why do you feel you need such a large K to start with?

Lee Rhodes

03/05/2023, 2:35 AM

If you start with a smaller K, it will speed up your processing 🙂

Karan Kumar

03/05/2023, 2:44 AM

We were initially on K = 4096 and then we saw usecases where 37B records were going into a single sketch

Lee Rhodes

03/05/2023, 2:45 AM

OK, and what happened?

Karan Kumar

03/05/2023, 2:46 AM

With a target size of 5M per partition, the error rate

0.063%

came out to be 3M so we had a distribution like :

Copy code

2M: 2, 4M: 320, 5M: 270, 6M: 6, 7M: 28, 8M: 4023, 9M: 24, 10M: 4, 11M: 4

So we increased the K values so that the error rate reduces.

Karan Kumar

03/05/2023, 2:48 AM

Since we donot know how many records will come into the system we went with the Biggest K value, and then downsampled stuff as we hit the memory limits.

Lee Rhodes

03/05/2023, 2:51 AM

But if you downsize 3 times, you will end up with a 4096 sketch, with the same error. I’ll have to think about this. Are the numbers above the resulting partition sizes of 18 partitions?

Lee Rhodes

03/05/2023, 2:51 AM

Do you have large numbers of duplicate rows?

Lee Rhodes

03/05/2023, 2:54 AM

And how successful is this strategy? How much variance across the partition sizes are you getting?

Karan Kumar

03/05/2023, 3:00 AM

The numbers above indicate num_rows_in_partition->num_partitions. So we had around 4.6k partitions . Our accuracy for the smaller datasets has improved.

Karan Kumar

03/05/2023, 3:02 AM

Regarding duplicate rows, we cannot control that. There may or may not be duplicates

Lee Rhodes

03/05/2023, 3:06 AM

Let me understand your numbers: So at the end of processing one huge number of rows, your numbers represent the distribution of the resulting ~4.6K partitions?

Karan Kumar

03/05/2023, 3:08 AM

Yes

Lee Rhodes

03/05/2023, 3:08 AM

So the range of partition sizes was from 2M rows to 11M rows

Lee Rhodes

03/05/2023, 3:08 AM

This is very very interesting.

Lee Rhodes

03/05/2023, 3:11 AM

Hmm, That is not very good. Let me think about this. Could you send me a sample of what a single row might look like that is fed to the sketch? Just one or two.

Karan Kumar

03/05/2023, 3:13 AM

I will need to dig that up. Will send it to you in a few hours ?

Lee Rhodes

03/05/2023, 4:17 AM

no problem. Curious, where are you located? India? 🙂

Lee Rhodes

03/05/2023, 4:17 AM

(sorry for the break, my wife called me for dinner 🙂 )

Lee Rhodes

03/05/2023, 4:26 AM

Your little distribution implies 62M rows and 4.68K partitions (like you said).

Karan Kumar

03/05/2023, 4:28 AM

Yes located in India. We were able to repro the issue by pumping in longs (1B) to the item sketch .

Lee Rhodes

03/05/2023, 4:28 AM

I have a hunch that the variance you are seeing is not due to sketch error, but due to a likely power-law distribution of duplicates. But I could be wrong. 🙂

Lee Rhodes

03/05/2023, 4:29 AM

Hmmmm. How did you generate the 1B longs?

Karan Kumar

03/05/2023, 4:33 AM

We generated the longs serially in ascending order so that when we get the quantiles we know the segment sizes by just looking at the value.

Lee Rhodes

03/05/2023, 4:34 AM

You mean 1,2,3,…., 1B?

Karan Kumar

03/05/2023, 4:35 AM

Yup

Lee Rhodes

03/05/2023, 4:35 AM

and you got a similar distribution with a lot of variance?

Lee Rhodes

03/05/2023, 4:36 AM

How many splits did you request?

Karan Kumar

03/05/2023, 4:37 AM

Yes the variance we got: rows_in_partition->num_partitions

1M: 1, 4M: 261, 5M: 268, 7M: 3, 8M: 4091, 11M: 1

Karan Kumar

03/05/2023, 4:38 AM

It was similar

Lee Rhodes

03/05/2023, 4:41 AM

Hmmm, I’m beginning to think we may have a bug in the item sketch.

Lee Rhodes

03/05/2023, 4:42 AM

A few more questions, if you don’t mind. I’d like to reproduce your experiment.

Lee Rhodes

03/05/2023, 4:43 AM

Did you, by chance try feeding the longs into the DoublesSketch?

Lee Rhodes

03/05/2023, 4:44 AM

Did you configure the ItemSketch with a generic <Long>?

Lee Rhodes

03/05/2023, 4:45 AM

Also, the little distribution you show above must be a sampling of the output, because the total rows is only 36M.

Lee Rhodes

03/05/2023, 4:52 AM

I mean sampling of the resulting partitions.

Karan Kumar

03/05/2023, 4:52 AM

Won't the total rows be 33B ?

Karan Kumar

03/05/2023, 4:52 AM

1M*1+4M*261+7M*3+8M*4091+11M*1

Karan Kumar

03/05/2023, 4:53 AM

IIRC we did not use the doubleSketch

Lee Rhodes

03/05/2023, 4:55 AM

Oh Right! Stupid me, I was misreading you data! 🤯

Lee Rhodes

03/05/2023, 4:56 AM

OK, now I get 35B

Karan Kumar

03/05/2023, 4:57 AM

It was most likely ItemsSketch<Long>

Lee Rhodes

03/05/2023, 4:58 AM

But if your input was just 1B rows, what is going on?

Lee Rhodes

03/05/2023, 5:00 AM

Or was this real data?

Karan Kumar

03/05/2023, 5:02 AM

Oops I misread the notes. I donot have the variance for test data

Karan Kumar

03/05/2023, 5:03 AM

This was most likely the real data.

Lee Rhodes

03/05/2023, 5:05 AM

OK, I think I have enough information to reproduce your experiment. I’m also going to try the DoubleSketch to compare. And your experiment also used a lgK of 15?

Karan Kumar

03/05/2023, 5:06 AM

Nope we did not experiment with that

Lee Rhodes

03/05/2023, 5:07 AM

Confirming: you set K = 32,768

Lee Rhodes

03/05/2023, 5:09 AM

It is getting late for me, I probably won’t get much time to spend on this until Monday.

Karan Kumar

03/05/2023, 5:10 AM

We set K=4096 for the above experimentation

Lee Rhodes

03/05/2023, 5:10 AM

OK, I’ll try K=4096 as well.

Karan Kumar

03/05/2023, 5:10 AM

Thank you!!

Lee Rhodes

03/05/2023, 5:12 AM

No problem, I hope you haven’t been struggling with this issue for very long. Anytime you see something strange like this you should put an issue on our datasketches-java repo. That way our whole team can see it. 🙂

Lee Rhodes

03/05/2023, 5:12 AM

Cheers for now,

🙌 2

Lee Rhodes

03/10/2023, 2:06 AM

@Gian Merlino @Vadim @Karan Kumar @Adarsh Sanjeev Here is my initial analysis of what is causing the high variance in your partition sizes. The short answer is that you are inadvertently requesting too many split points given the size of the maximum input stream and the size of the quantile sketch. In other words, the implied resolution you are requesting is very close to or smaller than the error limit of the sketch, thus causing a lot of noise and variation in the resulting partition sizes. To illustrate this I have run some simulations of what you have described to me about your use case and it comes down to these key parameters: • Input N can be as large as 37B. • Classic Quantile Sketch => ItemSketch<String> with K=2^15 • tgtPartSize: target partition size of 3M items. • Number of SplitPoints calculated by N/tgtPartSize. I first decided to model this using the Classic Quantiles DoubleSketch, as handling strings is much much slower. I felt that if I can reproduce the problem with the DoublesSketch then all I have to do is some spot measurements with the ItemSketch to eliminate anything peculiar with the ItemSketch. ( I haven’t done this last step, because what I have so far is quite convincing, I think, but I will.) My model has these parameters: • lgK ranges from 10 to 15 • inputN ranges from 1B to 39B • tgtPartSize = 2.5M items (which is a little worse than (3M). Varying these parameters produces these metrics: • NumPart: number of partitions = SplitPoints+1 • ImpRes; implied resolution = 1/numPart • hMean: average resulting partition size • hMax: maximum resulting partition size • hMin: minimum resulting partition size • hRange: hMax - hMin • hRange% = hRange/hMean *100 • Variance; the computed variance of all the partition sizes • StdDev: the computed standard deviation of all the partition sizes • StdDev/Mean%: • SkErr: the error of the sketch determined by K • Retained items: This gives a sense of the size of the sketch (in items, not bytes) (to be continued)

Gian Merlino

03/10/2023, 2:19 AM

yeah I think that makes sense: intuitively, we are asking for a lot of cut-points, meaning that we're very sensitive to errors. if we have 10 B input items, we're going to want about 2,000 cut-points, which means we want to get the 0.0005 ranked item, the 0.001 ranked item, the 0.0015 ranked item, and so on. so rank error has to be even smaller than that

Gian Merlino

03/10/2023, 2:20 AM

is that matching what you are finding as well?

Lee Rhodes

03/10/2023, 5:17 AM

(Continuing) Let’s look at the following chart:

Lee Rhodes

03/10/2023, 5:30 AM

This first chart confirms that as we configure the sketch with larger K, Not only does the native error of the sketch go down (green), so does the range of the variation in sizes (hRange%) of the partition reduce as well as the standard deviation of that variation (StdDev/Mean%). These two last metrics are normalized by the mean size of the partitions so we can get a sense of the impact. As you can see from the title of the chart, N=1B, and with a target partition size of 2.5M, this results in only 400 partitions. Thus, 1/400 => 2.5e-3 implied resolution (black line). The dashed curves are plotted on the right vertical axis. The key metrics of hRange% and StdDev/Mean% certainly improve with larger K.

Lee Rhodes

03/10/2023, 5:31 AM

(Continuing) Now let’s look at the next couple of plots:

Lee Rhodes

03/10/2023, 5:49 AM

The chart on the left is much more revealing. Now I’m plotting against N keeping the other parameters constant. Here N varies from 1e9 items to 39e9 items in steps of 2e9. LgK is constant at 15, the TgtPartSize is constant at 2.5e6, and since the sketch is a constant size the SkErr is constant at 8e-5. The black curve now decreases dramatically with each increase in N, where at the largest N, it actually crosses the green line and becomes less than the sketch error. This is bad news. And you can see the results in the erratic plots of the hRange% and the StdDev/Mean%. And as you can see at N=35B the variation in partition size is about 52% of the average size of the partitions! This explains the wild variations you are seeing. The StdDev/Mean% is smaller, because if you examine the Gaussian, +1 StdDev is only about 34% of the area under the curve. The second chart just confirms the underlying cause: If you always keep the target size of the resulting partitions the same size (here 2.5M items), as N increases, it demands more and more splits from the sketch, to the point where it demands more resolution from the sketch than it is designed to handle and you get very noisy results.

🙌 1

Lee Rhodes

03/10/2023, 6:00 AM

(Continuing) We have had quite a bit of internal discussion in our team about this and here are some of our thoughts. • Frankly, this use case has come as a bit of a surprise. When we designed these sketches, we anticipated large N, and that’s not the problem. What we didn’t anticipate was large N and large number of split-points. And this specific use-case requires values of K much larger than what we originally anticipated anyone ever needing. • We feel this needs a new sketch design specifically targeted for very large N, and very large # of split points, where K can scale much larger than 2^15. This is why we would like to huddle with you folks to understand more about your environment and your constraints. I hope you find this analysis interesting. Cheers,

Lee Rhodes

03/10/2023, 6:11 AM

@Gian Merlino I think this analysis answers your question 🙂

Gian Merlino

03/10/2023, 9:12 AM

i find this very interesting! especially the tantalizing prospect of a sketch more closely suited to our needs 🙂

Karan Kumar

03/10/2023, 1:49 PM

Super exciting stuff!!

Lee Rhodes

03/11/2023, 1:59 AM

@Karan Kumar Question, to obtain the final partitions, what sketch functions are you using: getQuantiles(…), getPMF(…), getCDF(…)?

Lee Rhodes

03/11/2023, 2:03 AM

And then what do you do to create the partitions?

Karan Kumar

03/11/2023, 5:00 AM

Question, to obtain the final partitions, what sketch functions are you using: getQuantiles(…), getPMF(…), getCDF(…)

We are using

sketch.getQuantiles()

We get the various ranks and partition the data per worker according to them.

Lee Rhodes

03/11/2023, 6:49 PM

The

T[] sketch.getQuantiles(int evenlySpaced)

, returns an array of quantile values that define an array of buckets where each bucket is assigned a value from the quantile array. Then you rescan the input stream and for each item you place it in a bucket where the value of the item is <= a bucket value and > than the bucket just below it. Is this correct?

Lee Rhodes

03/11/2023, 6:54 PM

This will produce a quasi-sorted partition array, where the overall array is sorted but the values in each partition will not be sorted, but somewhat “close” together. So now my question is. Why is this quasi-sorted property important?

Adarsh Sanjeev

03/12/2023, 8:44 AM

Yes, that is correct. Druid uses primary partitioning by timestamp (the first value of each array is normally that, if that is relevant), so the property we need is to ensure that any change in timestamp value while getting quantiles would be a split point (I think this is something quantile sketches don't really handle well). If we are say, partitioning by month, we use a different bucket for each month and use a different sketch to guarantee this property.

Lee Rhodes

03/12/2023, 7:57 PM

the property we need is to ensure that any change in timestamp value while getting quantiles would be a split point (I think this is something quantile sketches don’t really handle well).

This is the first I have heard of this issue. Has this been reported to us as an issue on our website? If we don’t know about an issue, we can’t investigate it 🙂 . I can guess that it may be because quantile sketches are a stochastic sampling process, and in that process some date transitions get skipped over resulting in a transition that effectively occurs at some other nearby point but not exactly where the transition occurs in the raw data (but I can’t be sure). If you know the first part of your input string is a date field, why don’t you just do an initial scan over your data and spray the items into date buckets? You don’t need a sketch for that.

If we are say, partitioning by month, we use a different bucket for each month and use a different sketch to guarantee this property.

What different sketch?

Gian Merlino

03/13/2023, 9:05 PM

If you know the first part of your input string is a date field, why don’t you just do an initial scan over your data and spray the items into date buckets? You don’t need a sketch for that.

This is basically what we're doing today. The issue we have is we have a certain memory budget to use across all buckets, so because we're managing one sketch per bucket, we need to manage that as we add things to the various buckets. Whenever we're about to exceed our memory budget, we pick a bucket and call

downSample

on the sketch for that bucket. The way we pick the bucket… "works" but is probably not ideal.

Gian Merlino

03/13/2023, 9:06 PM

This is the first I have heard of this issue. Has this been reported to us as an issue on our website?

No, since we figured it was out of scope for any of the sketches you have

Vadim

03/13/2023, 10:28 PM

Heads up everyone I made a new channel called #C04TB8M30DV for everything DataSketches related, please join if interested.

2 Views

Open in Slack

Previous Next