Apache Druid #data-sketches

Join Slack

Channels

main

ask-struct-ai

data-sketches

dev

docs-and-training

equinix-imply-external

Lee Rhodes

03/28/2023, 11:25 PM

It all depends on what you are trying to do, and whether there are other downstream uses of your raw data or the sketches! There are applications where the raw data is kept and the sketches are thrown away (after computing whatever metric was needed); and there are applications where the sketches are kept (sometimes for years) and the raw data is tossed (or archived on high density media). There is no one answer, and something each use-case must figure out. However, there is one very important property that you should be concerned about, and that is the backward compatibility of whatever library you are using to be able to read and interpret stored sketch images from previous versions — and this can go back years. This is something that we have paid close attention to as we have users that still need to read or merge sketches from old versions of 10 years ago. Does t-digest have that property? I have no idea.

Slackbot

03/29/2023, 1:21 AM

This message was deleted.

Slackbot

04/28/2023, 8:27 PM

This message was deleted.

Slackbot

04/29/2023, 1:38 AM

This message was deleted.

Alexander Saydakov

05/19/2023, 9:14 PM

in the process of making changes in Druid to work with the fresh datasketches-java-4.0.0 I noticed a suspicious thing - there is a functionality to update the sketch only if the incoming value would set a new min or max

Alexander Saydakov

05/19/2023, 9:14 PM

https://github.com/apache/druid/blob/269137c6828c5a2a4f7308c33d239c2c019c87b7/inde[…]exing/common/task/batch/parallel/distribution/StringSketch.java

Alexander Saydakov

05/19/2023, 9:15 PM

I believe that the sketch must observe all values to estimate the distribution properly

Alexander Saydakov

05/19/2023, 10:51 PM

could someone explain what is going on?

Gian Merlino

05/20/2023, 12:45 AM

@Alexander Saydakov The logic seems to be that, in this particular area in the code, we want to add distinct values from a stream to the quantiles sketch. So whenever we get a new value we first check it against a Bloom filter (

DedupInputRowFilter

) for approximate deduplication, then if it passes the filter, we add it to the

ItemsSketch

. Later on, we're going to use the

ItemsSketch

to get quantiles, min, and max. Because the Bloom filter can give us false positives, it's possible we don't add every distinct item to the

ItemsSketch

. In particular we might miss the true min and true max, which is problematic since we need lower/upper bounds on the data. So, the code you're looking at "fixes up" the sketch by adding the true min and max, in case they had got filtered out by the Bloom filter.

Alexander Saydakov

05/20/2023, 12:47 AM

why feed distinct items to quantiles sketch?

Alexander Saydakov

05/20/2023, 12:47 AM

sketch has to see all items to estimate distribution

Gian Merlino

05/20/2023, 12:50 AM

The motivation is very similar to the discussion upthread, starting around https://apachedruidworkspace.slack.com/archives/C04TB8M30DV/p1678747770123229 We have a stream of data that we are going to aggregate by some grouping key. We also want to partition the aggregated data by ranges of the grouping key, into some number of equally-sized partitions. We figure out the partition boundaries by feeding the initial, unaggregated stream of data into this Bloom filter +

ItemsSketch

pair, then using quantiles from the

ItemsSketch

as the partition boundaries. The

ItemsSketch

has to see distinct items only, because we want to estimate distribution of the aggregated data, not the original stream

Gian Merlino

05/20/2023, 12:51 AM

The items for the

ItemsSketch

in this case are the distinct grouping keys for each row of the original stream

Alexander Saydakov

05/20/2023, 12:55 AM

ok, I am not sure this is the best way, but at least this is intentional. thanks

Gian Merlino

05/20/2023, 1:08 AM

I'm not sure either 🙂 but yes it's intentional

Gian Merlino

05/20/2023, 1:08 AM

np!

Slackbot

05/24/2023, 2:06 PM

This message was deleted.

Slackbot

05/25/2023, 9:05 PM

This message was deleted.

Alexander Saydakov

09/30/2023, 1:16 AM

I would like to discuss how to add the Theta sketch compression to Druid. The compression was released in May 2023 as a part of Apache datasketches-java 4.0.0. The DataSketches team was reluctant to enable it by default (as a part of the existing serialization). A separate method has to be called by the user of the library. The Druid Theta sketch aggregator does not take advantage of the compression yet. I can see several approaches: 1. Change the Druid code to use compression. 2. Change the Druid code to use compression by default with some parameter to disable it. 3. Change the Druid code to add some parameter to enable it if desired, but no compression by default. I am not sure how to add a parameter to an aggregator that can be easily set by the user. Can this be done on a field or table level?

Alexander Saydakov

10/02/2023, 6:31 PM

The compression certainly has some overhead, but overall effect on a big system is not that straightforward and likely can even speed things up depending on many factors such as distribution of sketches (number of small and large sketches). Essentially the tradeoff is between I/O and CPU. Many big systems are I/O bound, so it makes sense to spend more CPU to reduce I/O. Besides that, by default Theta sketches are stored as ordered array, and merge (union) operation can take advantage of that: no need to iterate over the whole input sketch, stop if the next value is greater than the theta of the union. For the compressed sketch that means reading less data, but not decompressing the whole thing.

Karan Kumar

10/23/2023, 6:34 AM

I recently found an issue with : https://github.com/apache/druid/pull/14334 The datasketch version 4.xx spitting out weird splits for an

ItemSketch<Long>

when the function

sketch.getPartitionBoundaries(numSplits)

When an item sketch with monotonically increasing number from 0 to 10 billion is chosen the distribution for 40 splits

Copy code

Similar item sketch in the 3.2.0 generates the correct boundaries when the function

sketch.getQuantiles(40)

is used.

Copy code

Pseudo code used for experiments

Copy code

ItemsSketch<Long> sketch = ItemsSketch.getInstance( 32768, Comparator.naturalOrder());

    for(long i=0;i<10_000_000_000L;i++){
      if(i%100_000_000==0){
        System.out.println("reached "+i);
      }
      sketch.update(i);
    }

for(Long val:sketch.getQuantiles(40)){
      System.out.println(val);
    }

cc @Alexander Saydakov @Lee Rhodes Starting 🧵 for discussion.

Lee Rhodes

10/23/2023, 10:32 PM

@Karan Kumar,

Lee Rhodes

10/23/2023, 10:33 PM

When I wrote a similar (but more revealing) test against 4.2.0 I get an excellent result:

Partition Analysis:

Boundary           Delta       Fraction%

199,969,717     199,969,716           2.000

399,985,597     200,015,880           2.000

599,999,650     200,014,053           2.000

800,017,439     200,017,789           2.000

1,000,035,248     200,017,809           2.000

1,200,038,727     200,003,479           2.000

1,400,053,887     200,015,160           2.000

1,600,070,015     200,016,128           2.000

1,800,082,165     200,012,150           2.000

2,000,100,377     200,018,212           2.000

2,200,164,013     200,063,636           2.001

2,400,182,047     200,018,034           2.000

2,600,196,540     200,014,493           2.000

2,800,210,004     200,013,464           2.000

3,000,226,546     200,016,542           2.000

3,200,243,833     200,017,287           2.000

3,399,999,385     199,755,552           1.998

3,600,010,298     200,010,913           2.000

3,800,030,576     200,020,278           2.000

4,000,045,010     200,014,434           2.000

4,200,055,030     200,010,020           2.000

4,400,081,302     200,026,272           2.000

4,600,094,724     200,013,422           2.000

4,800,114,270     200,019,546           2.000

5,000,119,607     200,005,337           2.000

5,200,135,603     200,015,996           2.000

5,400,158,879     200,023,276           2.000

5,600,175,975     200,017,096           2.000

5,800,197,847     200,021,872           2.000

6,000,212,728     200,014,881           2.000

6,200,227,844     200,015,116           2.000

6,400,244,241     200,016,397           2.000

6,600,221,588     199,977,347           2.000

6,799,976,613     199,755,025           1.998

6,999,997,424     200,020,811           2.000

7,200,013,051     200,015,627           2.000

7,400,028,055     200,015,004           2.000

7,600,037,476     200,009,421           2.000

7,800,046,117     200,008,641           2.000

8,000,064,800     200,018,683           2.000

8,200,088,478     200,023,678           2.000

8,400,105,281     200,016,803           2.000

8,600,000,395     199,895,114           1.999

8,799,984,976     199,984,581           2.000

8,999,996,017     200,011,041           2.000

9,199,985,554     199,989,537           2.000

9,400,005,507     200,019,953           2.000

9,599,987,941     199,982,434           2.000

9,800,002,023     200,014,082           2.000

10,000,000,000     199,997,977           2.000

This used the following code:

@Test

public

*void* checkItemsSketch() {

ItemsSketch<Long> sk = ItemsSketch._getInstance_(Long.*class*, 1 << 15, Comparator._naturalOrder_());

*long* n = 10_000_000_000L;

*long* check =n/10;

*for* (*long* i = 1; i <= n; i++) {

sk.update(i);

*if* (i%check == 0) { _println_("reached " + i); }

_println_("Partition Analysis:");

_printf_("%15s %15s %15s\n", "Boundary","Delta", "Fraction%");

GenericPartitionBoundaries<Long> gpb = sk.getPartitionBoundaries(50, _*INCLUSIVE*_);

Long[] boundaries = gpb.boundaries;

*int* numB = boundaries.length;

*long* last = -1L;

*for* (*int* i = 0; i < numB; i++) {

*if* (i == 0) { last = boundaries[0]; _printf_("%,15d\n",boundaries[0]); *continue*; }

*long* cur = boundaries[i];

*long* delta = cur - last;

last = cur;

_printf_("%,15d %,15d %15.3f\n", cur, delta, (*double*)delta/n * 100.0);

Omitting the “print*” functions. I have not tried 4.0 or 4.1 yet.

Slackbot

10/23/2023, 10:49 PM

This message was deleted.

Tarun Dhandharia

03/22/2024, 6:51 AM

Hi Team, we are using Theta Sketch to get unique counts. However, we observe that if the % of distinct values are very high(>90%), the unique count estimate returned by the sketch is even more than the total count. Is there any configuration that can ensure that the unique count will be always less than the total count ? Is there any other Sketch which has such option

Daniel Augusto

05/17/2024, 2:46 PM

Im trying to generate two distributions using quantiles data sketch and then compare them. Say, get p25, p50 and p75 values from the 'reference' ds and check what percentile they would be in the 'actual' distribution. Something like this

Copy code

WITH reference as (
  SELECT dimension, DS_QUANTILES_SKETCH("feature") as ref_quantiles, 
  FROM reference_table
  GROUP BY 1
),
actual as (
  SELECT dimension, DS_QUANTILES_SKETCH("feature") as act_quantiles
  FROM events_table
  GROUP BY 1
)
SELECT feature, 
    DS_RANK(act_quantiles, DS_GET_QUANTILE(ref_quantiles, 0.25)),
    DS_RANK(act_quantiles, DS_GET_QUANTILE(ref_quantiles, 0.50)),
    DS_RANK(act_quantiles, DS_GET_QUANTILE(ref_quantiles, 0.75))
FROM actual JOIN reference ON actual.feature = reference.feature

But I got only errors like. Nothing else in broker logs (DEBUG level is on)

Copy code

2024-05-17T13:45:18,964 WARN [sql[d2ac2fe3-4ade-4c2e-8c5e-6ec5fc3c12e8]] org.apache.druid.sql.http.SqlResource - Exception while processing sqlQueryId[d2ac2fe3-4ade-4c2e-8c5e-6ec5fc3c12e8] (org.apache.druid.error.DruidException: Query could not be planned. A possible reason is [SQL requires a join with 'SEARCH' condition that is not supported.])

Alexander Saydakov

05/17/2024, 6:57 PM

I don’t think I can help with SQL syntax, but I would like to comment on comparing distributions. In the core datasketches library there is a Kolmogorov-Smitnov test, but it is not available in Druid at the moment

Alexander Saydakov

05/17/2024, 7:22 PM

I would think that it should not be hard to add a post-agg implementing KS test similarly to Student’s t-test for Tuple sketches. it should take two sketches and p-value and return boolean (true - different distributions). it can compute raw delta as well, but then you would need to know what to do with it

victor regalado

04/07/2025, 9:10 PM

Are there plans to suppor kll sketches in druid sql ?

Ben Smithgall

05/01/2025, 4:56 PM

Hello -- I had a question about Theta Sketches and some behavior that is different from Druid 29 to Druid 31+. More details are in a thread.