https://linen.dev logo
Join Slack
Channels
main
ask-struct-ai
data-sketches
dev
docs-and-training
equinix-imply-external
general
kubernetes-druid
npci-imply-external
random
release
troubleshooting
web-console
Powered by
# data-sketches
  • l

    Lee Rhodes

    03/28/2023, 11:25 PM
    It all depends on what you are trying to do, and whether there are other downstream uses of your raw data or the sketches! There are applications where the raw data is kept and the sketches are thrown away (after computing whatever metric was needed); and there are applications where the sketches are kept (sometimes for years) and the raw data is tossed (or archived on high density media). There is no one answer, and something each use-case must figure out. However, there is one very important property that you should be concerned about, and that is the backward compatibility of whatever library you are using to be able to read and interpret stored sketch images from previous versions — and this can go back years. This is something that we have paid close attention to as we have users that still need to read or merge sketches from old versions of 10 years ago. Does t-digest have that property? I have no idea.
  • s

    Slackbot

    03/29/2023, 1:21 AM
    This message was deleted.
    m
    a
    • 3
    • 11
  • s

    Slackbot

    04/28/2023, 8:27 PM
    This message was deleted.
    g
    • 2
    • 1
  • s

    Slackbot

    04/29/2023, 1:38 AM
    This message was deleted.
    l
    • 2
    • 2
  • a

    Alexander Saydakov

    05/19/2023, 9:14 PM
    in the process of making changes in Druid to work with the fresh datasketches-java-4.0.0 I noticed a suspicious thing - there is a functionality to update the sketch only if the incoming value would set a new min or max
  • a

    Alexander Saydakov

    05/19/2023, 9:14 PM
    https://github.com/apache/druid/blob/269137c6828c5a2a4f7308c33d239c2c019c87b7/inde[…]exing/common/task/batch/parallel/distribution/StringSketch.java
  • a

    Alexander Saydakov

    05/19/2023, 9:15 PM
    I believe that the sketch must observe all values to estimate the distribution properly
  • a

    Alexander Saydakov

    05/19/2023, 10:51 PM
    could someone explain what is going on?
  • g

    Gian Merlino

    05/20/2023, 12:45 AM
    @Alexander Saydakov The logic seems to be that, in this particular area in the code, we want to add distinct values from a stream to the quantiles sketch. So whenever we get a new value we first check it against a Bloom filter (
    DedupInputRowFilter
    ) for approximate deduplication, then if it passes the filter, we add it to the
    ItemsSketch
    . Later on, we're going to use the
    ItemsSketch
    to get quantiles, min, and max. Because the Bloom filter can give us false positives, it's possible we don't add every distinct item to the
    ItemsSketch
    . In particular we might miss the true min and true max, which is problematic since we need lower/upper bounds on the data. So, the code you're looking at "fixes up" the sketch by adding the true min and max, in case they had got filtered out by the Bloom filter.
  • a

    Alexander Saydakov

    05/20/2023, 12:47 AM
    why feed distinct items to quantiles sketch?
  • a

    Alexander Saydakov

    05/20/2023, 12:47 AM
    sketch has to see all items to estimate distribution
  • g

    Gian Merlino

    05/20/2023, 12:50 AM
    The motivation is very similar to the discussion upthread, starting around https://apachedruidworkspace.slack.com/archives/C04TB8M30DV/p1678747770123229 We have a stream of data that we are going to aggregate by some grouping key. We also want to partition the aggregated data by ranges of the grouping key, into some number of equally-sized partitions. We figure out the partition boundaries by feeding the initial, unaggregated stream of data into this Bloom filter +
    ItemsSketch
    pair, then using quantiles from the
    ItemsSketch
    as the partition boundaries. The
    ItemsSketch
    has to see distinct items only, because we want to estimate distribution of the aggregated data, not the original stream
  • g

    Gian Merlino

    05/20/2023, 12:51 AM
    The items for the
    ItemsSketch
    in this case are the distinct grouping keys for each row of the original stream
  • a

    Alexander Saydakov

    05/20/2023, 12:55 AM
    ok, I am not sure this is the best way, but at least this is intentional. thanks
  • g

    Gian Merlino

    05/20/2023, 1:08 AM
    I'm not sure either 🙂 but yes it's intentional
  • g

    Gian Merlino

    05/20/2023, 1:08 AM
    np!
  • s

    Slackbot

    05/24/2023, 2:06 PM
    This message was deleted.
    y
    g
    • 3
    • 6
  • s

    Slackbot

    05/25/2023, 9:05 PM
    This message was deleted.
    a
    a
    • 3
    • 2
  • a

    Alexander Saydakov

    09/30/2023, 1:16 AM
    I would like to discuss how to add the Theta sketch compression to Druid. The compression was released in May 2023 as a part of Apache datasketches-java 4.0.0. The DataSketches team was reluctant to enable it by default (as a part of the existing serialization). A separate method has to be called by the user of the library. The Druid Theta sketch aggregator does not take advantage of the compression yet. I can see several approaches: 1. Change the Druid code to use compression. 2. Change the Druid code to use compression by default with some parameter to disable it. 3. Change the Druid code to add some parameter to enable it if desired, but no compression by default. I am not sure how to add a parameter to an aggregator that can be easily set by the user. Can this be done on a field or table level?
  • a

    Alexander Saydakov

    10/02/2023, 6:31 PM
    The compression certainly has some overhead, but overall effect on a big system is not that straightforward and likely can even speed things up depending on many factors such as distribution of sketches (number of small and large sketches). Essentially the tradeoff is between I/O and CPU. Many big systems are I/O bound, so it makes sense to spend more CPU to reduce I/O. Besides that, by default Theta sketches are stored as ordered array, and merge (union) operation can take advantage of that: no need to iterate over the whole input sketch, stop if the next value is greater than the theta of the union. For the compressed sketch that means reading less data, but not decompressing the whole thing.
  • k

    Karan Kumar

    10/23/2023, 6:34 AM
    I recently found an issue with : https://github.com/apache/druid/pull/14334 The datasketch version 4.xx spitting out weird splits for an
    ItemSketch<Long>
    when the function
    sketch.getPartitionBoundaries(numSplits)
    When an item sketch with monotonically increasing number from 0 to 10 billion is chosen the distribution for 40 splits
    Copy code
    0
    250025928
    500108450
    750197955
    1000017365
    1250101575
    1500191183
    1750009306
    2000098020
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    2147424894
    9999999999
    Similar item sketch in the 3.2.0 generates the correct boundaries when the function
    sketch.getQuantiles(40)
    is used.
    Copy code
    0
    256423887
    512799757
    769177971
    1025548880
    1281929124
    1538304632
    1794673747
    2051315187
    2307669277
    2564042244
    2820409807
    3076787762
    3333187311
    3589563105
    3845931306
    4102575213
    4359042434
    4615417711
    4871794533
    5128171055
    5384552162
    5640921934
    5897299914
    6153942953
    6410313978
    6666657948
    6923038164
    7179411194
    7435787409
    7692168813
    7948547259
    8205184939
    8461560834
    8717940855
    8974345598
    9230745974
    9487159248
    9743589767
    9999999999
    Pseudo code used for experiments
    Copy code
    ItemsSketch<Long> sketch = ItemsSketch.getInstance( 32768, Comparator.naturalOrder());
    
        for(long i=0;i<10_000_000_000L;i++){
          if(i%100_000_000==0){
            System.out.println("reached "+i);
          }
          sketch.update(i);
        }
    
    for(Long val:sketch.getQuantiles(40)){
          System.out.println(val);
        }
    cc @Alexander Saydakov @Lee Rhodes Starting 🧵 for discussion.
  • l

    Lee Rhodes

    10/23/2023, 10:32 PM
    @Karan Kumar,
  • l

    Lee Rhodes

    10/23/2023, 10:33 PM
    When I wrote a similar (but more revealing) test against 4.2.0 I get an excellent result:
    Partition Analysis:
    Boundary           Delta       Fraction%
    1
    199,969,717     199,969,716           2.000
    399,985,597     200,015,880           2.000
    599,999,650     200,014,053           2.000
    800,017,439     200,017,789           2.000
    1,000,035,248     200,017,809           2.000
    1,200,038,727     200,003,479           2.000
    1,400,053,887     200,015,160           2.000
    1,600,070,015     200,016,128           2.000
    1,800,082,165     200,012,150           2.000
    2,000,100,377     200,018,212           2.000
    2,200,164,013     200,063,636           2.001
    2,400,182,047     200,018,034           2.000
    2,600,196,540     200,014,493           2.000
    2,800,210,004     200,013,464           2.000
    3,000,226,546     200,016,542           2.000
    3,200,243,833     200,017,287           2.000
    3,399,999,385     199,755,552           1.998
    3,600,010,298     200,010,913           2.000
    3,800,030,576     200,020,278           2.000
    4,000,045,010     200,014,434           2.000
    4,200,055,030     200,010,020           2.000
    4,400,081,302     200,026,272           2.000
    4,600,094,724     200,013,422           2.000
    4,800,114,270     200,019,546           2.000
    5,000,119,607     200,005,337           2.000
    5,200,135,603     200,015,996           2.000
    5,400,158,879     200,023,276           2.000
    5,600,175,975     200,017,096           2.000
    5,800,197,847     200,021,872           2.000
    6,000,212,728     200,014,881           2.000
    6,200,227,844     200,015,116           2.000
    6,400,244,241     200,016,397           2.000
    6,600,221,588     199,977,347           2.000
    6,799,976,613     199,755,025           1.998
    6,999,997,424     200,020,811           2.000
    7,200,013,051     200,015,627           2.000
    7,400,028,055     200,015,004           2.000
    7,600,037,476     200,009,421           2.000
    7,800,046,117     200,008,641           2.000
    8,000,064,800     200,018,683           2.000
    8,200,088,478     200,023,678           2.000
    8,400,105,281     200,016,803           2.000
    8,600,000,395     199,895,114           1.999
    8,799,984,976     199,984,581           2.000
    8,999,996,017     200,011,041           2.000
    9,199,985,554     199,989,537           2.000
    9,400,005,507     200,019,953           2.000
    9,599,987,941     199,982,434           2.000
    9,800,002,023     200,014,082           2.000
    10,000,000,000     199,997,977           2.000
    This used the following code:
    @Test
    public
    *void* checkItemsSketch() {
    ItemsSketch<Long> sk = ItemsSketch._getInstance_(Long.*class*, 1 << 15, Comparator._naturalOrder_());
    *long* n = 10_000_000_000L;
    *long* check =n/10;
    *for* (*long* i = 1; i <= n; i++) {
    sk.update(i);
    *if* (i%check == 0) { _println_("reached " + i); }
    }
    _println_("Partition Analysis:");
    _printf_("%15s %15s %15s\n", "Boundary","Delta", "Fraction%");
    GenericPartitionBoundaries<Long> gpb = sk.getPartitionBoundaries(50, _*INCLUSIVE*_);
    Long[] boundaries = gpb.boundaries;
    *int* numB = boundaries.length;
    *long* last = -1L;
    *for* (*int* i = 0; i < numB; i++) {
    *if* (i == 0) { last = boundaries[0]; _printf_("%,15d\n",boundaries[0]); *continue*; }
    *long* cur = boundaries[i];
    *long* delta = cur - last;
    last = cur;
    _printf_("%,15d %,15d %15.3f\n", cur, delta, (*double*)delta/n * 100.0);
    }
    }
    Omitting the “print*” functions. I have not tried 4.0 or 4.1 yet.
  • s

    Slackbot

    10/23/2023, 10:49 PM
    This message was deleted.
    j
    l
    +4
    • 7
    • 259
  • t

    Tarun Dhandharia

    03/22/2024, 6:51 AM
    Hi Team, we are using Theta Sketch to get unique counts. However, we observe that if the % of distinct values are very high(>90%), the unique count estimate returned by the sketch is even more than the total count. Is there any configuration that can ensure that the unique count will be always less than the total count ? Is there any other Sketch which has such option
    j
    a
    l
    • 4
    • 3
  • d

    Daniel Augusto

    05/17/2024, 2:46 PM
    Im trying to generate two distributions using quantiles data sketch and then compare them. Say, get p25, p50 and p75 values from the 'reference' ds and check what percentile they would be in the 'actual' distribution. Something like this
    Copy code
    WITH reference as (
      SELECT dimension, DS_QUANTILES_SKETCH("feature") as ref_quantiles, 
      FROM reference_table
      GROUP BY 1
    ),
    actual as (
      SELECT dimension, DS_QUANTILES_SKETCH("feature") as act_quantiles
      FROM events_table
      GROUP BY 1
    )
    SELECT feature, 
        DS_RANK(act_quantiles, DS_GET_QUANTILE(ref_quantiles, 0.25)),
        DS_RANK(act_quantiles, DS_GET_QUANTILE(ref_quantiles, 0.50)),
        DS_RANK(act_quantiles, DS_GET_QUANTILE(ref_quantiles, 0.75))
    FROM actual JOIN reference ON actual.feature = reference.feature
    But I got only errors like. Nothing else in broker logs (DEBUG level is on)
    Copy code
    2024-05-17T13:45:18,964 WARN [sql[d2ac2fe3-4ade-4c2e-8c5e-6ec5fc3c12e8]] org.apache.druid.sql.http.SqlResource - Exception while processing sqlQueryId[d2ac2fe3-4ade-4c2e-8c5e-6ec5fc3c12e8] (org.apache.druid.error.DruidException: Query could not be planned. A possible reason is [SQL requires a join with 'SEARCH' condition that is not supported.])
  • a

    Alexander Saydakov

    05/17/2024, 6:57 PM
    I don’t think I can help with SQL syntax, but I would like to comment on comparing distributions. In the core datasketches library there is a Kolmogorov-Smitnov test, but it is not available in Druid at the moment
  • a

    Alexander Saydakov

    05/17/2024, 7:22 PM
    I would think that it should not be hard to add a post-agg implementing KS test similarly to Student’s t-test for Tuple sketches. it should take two sketches and p-value and return boolean (true - different distributions). it can compute raw delta as well, but then you would need to know what to do with it
    d
    • 2
    • 1
  • v

    victor regalado

    04/07/2025, 9:10 PM
    Are there plans to suppor kll sketches in druid sql ?
    g
    j
    a
    • 4
    • 6
  • b

    Ben Smithgall

    05/01/2025, 4:56 PM
    Hello -- I had a question about Theta Sketches and some behavior that is different from Druid 29 to Druid 31+. More details are in a thread.
    g
    • 2
    • 6