Maytas Monsereenusorn
03/28/2023, 8:16 PMAlexander Saydakov
03/28/2023, 8:30 PMMaytas Monsereenusorn
03/28/2023, 9:26 PMMaytas Monsereenusorn
03/28/2023, 9:26 PMLee Rhodes
03/28/2023, 11:25 PMGian Merlino
03/29/2023, 1:21 AMLee Rhodes
04/28/2023, 8:27 PMSlackbot
04/29/2023, 1:38 AMAlexander Saydakov
05/19/2023, 9:14 PMAlexander Saydakov
05/19/2023, 9:14 PMAlexander Saydakov
05/19/2023, 9:15 PMAlexander Saydakov
05/19/2023, 10:51 PMGian Merlino
05/20/2023, 12:45 AMDedupInputRowFilter
) for approximate deduplication, then if it passes the filter, we add it to the ItemsSketch
. Later on, we're going to use the ItemsSketch
to get quantiles, min, and max.
Because the Bloom filter can give us false positives, it's possible we don't add every distinct item to the ItemsSketch
. In particular we might miss the true min and true max, which is problematic since we need lower/upper bounds on the data. So, the code you're looking at "fixes up" the sketch by adding the true min and max, in case they had got filtered out by the Bloom filter.Alexander Saydakov
05/20/2023, 12:47 AMAlexander Saydakov
05/20/2023, 12:47 AMGian Merlino
05/20/2023, 12:50 AMItemsSketch
pair, then using quantiles from the ItemsSketch
as the partition boundaries.
The ItemsSketch
has to see distinct items only, because we want to estimate distribution of the aggregated data, not the original streamGian Merlino
05/20/2023, 12:51 AMItemsSketch
in this case are the distinct grouping keys for each row of the original streamAlexander Saydakov
05/20/2023, 12:55 AMGian Merlino
05/20/2023, 1:08 AMGian Merlino
05/20/2023, 1:08 AMYounes Naguib
05/24/2023, 2:06 PMjava.lang.OutOfMemoryError: Direct buffer memory
when I use "nominalEntries": 16384
. Works just fine when I use "nominalEntries": 4096
.
I have plenty of memory available, and -XX:MaxDirectMemorySize=24g
. Any ideas what I could be missing?Alexander Saydakov
05/25/2023, 9:05 PMAlexander Saydakov
09/30/2023, 1:16 AMAlexander Saydakov
10/02/2023, 6:20 PMAlexander Saydakov
10/02/2023, 6:31 PMKaran Kumar
10/23/2023, 6:34 AMItemSketch<Long>
when the function sketch.getPartitionBoundaries(numSplits)
When an item sketch with monotonically increasing number from 0 to 10 billion is chosen the distribution for 40 splits
0
250025928
500108450
750197955
1000017365
1250101575
1500191183
1750009306
2000098020
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
9999999999
Similar item sketch in the 3.2.0 generates the correct boundaries when the function sketch.getQuantiles(40)
is used.
0
256423887
512799757
769177971
1025548880
1281929124
1538304632
1794673747
2051315187
2307669277
2564042244
2820409807
3076787762
3333187311
3589563105
3845931306
4102575213
4359042434
4615417711
4871794533
5128171055
5384552162
5640921934
5897299914
6153942953
6410313978
6666657948
6923038164
7179411194
7435787409
7692168813
7948547259
8205184939
8461560834
8717940855
8974345598
9230745974
9487159248
9743589767
9999999999
Pseudo code used for experiments
ItemsSketch<Long> sketch = ItemsSketch.getInstance( 32768, Comparator.naturalOrder());
for(long i=0;i<10_000_000_000L;i++){
if(i%100_000_000==0){
System.out.println("reached "+i);
}
sketch.update(i);
}
for(Long val:sketch.getQuantiles(40)){
System.out.println(val);
}
cc @Alexander Saydakov @Lee Rhodes
Starting 🧵 for discussion.Lee Rhodes
10/23/2023, 10:32 PMLee Rhodes
10/23/2023, 10:33 PMPartition Analysis:
Boundary Delta Fraction%
1
199,969,717 199,969,716 2.000
399,985,597 200,015,880 2.000
599,999,650 200,014,053 2.000
800,017,439 200,017,789 2.000
1,000,035,248 200,017,809 2.000
1,200,038,727 200,003,479 2.000
1,400,053,887 200,015,160 2.000
1,600,070,015 200,016,128 2.000
1,800,082,165 200,012,150 2.000
2,000,100,377 200,018,212 2.000
2,200,164,013 200,063,636 2.001
2,400,182,047 200,018,034 2.000
2,600,196,540 200,014,493 2.000
2,800,210,004 200,013,464 2.000
3,000,226,546 200,016,542 2.000
3,200,243,833 200,017,287 2.000
3,399,999,385 199,755,552 1.998
3,600,010,298 200,010,913 2.000
3,800,030,576 200,020,278 2.000
4,000,045,010 200,014,434 2.000
4,200,055,030 200,010,020 2.000
4,400,081,302 200,026,272 2.000
4,600,094,724 200,013,422 2.000
4,800,114,270 200,019,546 2.000
5,000,119,607 200,005,337 2.000
5,200,135,603 200,015,996 2.000
5,400,158,879 200,023,276 2.000
5,600,175,975 200,017,096 2.000
5,800,197,847 200,021,872 2.000
6,000,212,728 200,014,881 2.000
6,200,227,844 200,015,116 2.000
6,400,244,241 200,016,397 2.000
6,600,221,588 199,977,347 2.000
6,799,976,613 199,755,025 1.998
6,999,997,424 200,020,811 2.000
7,200,013,051 200,015,627 2.000
7,400,028,055 200,015,004 2.000
7,600,037,476 200,009,421 2.000
7,800,046,117 200,008,641 2.000
8,000,064,800 200,018,683 2.000
8,200,088,478 200,023,678 2.000
8,400,105,281 200,016,803 2.000
8,600,000,395 199,895,114 1.999
8,799,984,976 199,984,581 2.000
8,999,996,017 200,011,041 2.000
9,199,985,554 199,989,537 2.000
9,400,005,507 200,019,953 2.000
9,599,987,941 199,982,434 2.000
9,800,002,023 200,014,082 2.000
10,000,000,000 199,997,977 2.000
This used the following code:
@Test
public
*void* checkItemsSketch() {
ItemsSketch<Long> sk = ItemsSketch._getInstance_(Long.*class*, 1 << 15, Comparator._naturalOrder_());
*long* n = 10_000_000_000L;
*long* check =n/10;
*for* (*long* i = 1; i <= n; i++) {
sk.update(i);
*if* (i%check == 0) { _println_("reached " + i); }
}
_println_("Partition Analysis:");
_printf_("%15s %15s %15s\n", "Boundary","Delta", "Fraction%");
GenericPartitionBoundaries<Long> gpb = sk.getPartitionBoundaries(50, _*INCLUSIVE*_);
Long[] boundaries = gpb.boundaries;
*int* numB = boundaries.length;
*long* last = -1L;
*for* (*int* i = 0; i < numB; i++) {
*if* (i == 0) { last = boundaries[0]; _printf_("%,15d\n",boundaries[0]); *continue*; }
*long* cur = boundaries[i];
*long* delta = cur - last;
last = cur;
_printf_("%,15d %,15d %15.3f\n", cur, delta, (*double*)delta/n * 100.0);
}
}
Omitting the “print*” functions.
I have not tried 4.0 or 4.1 yet.Lee Rhodes
10/23/2023, 10:49 PMTarun Dhandharia
03/22/2024, 6:51 AM