Is there any document on how theta-sketch columns ...
# general
c
Is there any document on how theta-sketch columns should be generated? In the Pinot doc of
DistinctCountThetaSketch
it mentioned
thetaSketchColumn
. Is that column supposed to be serialized binary (hex string I suppose) of Theta Sketch framework?
Copy code
UpdateSketch sketch2 = UpdateSketch.builder().build();
  for (int key = 50000; key < 150000; key++) sketch2.update(key);
  FileOutputStream out2 = new FileOutputStream("ThetaSketch2.bin");
  out2.write(sketch2.compact().toByteArray()); // or hexString()
m
sketch.compact().toByteArray()
👍 1
Of course, it needs to use the same datasketch library as Pinot uses.
Mind giving back to the community by adding it to the docs, for the next guy who has this question?
(Perhaps after you have verified it works - so you can add more info as needed)
c
For sure. For your context, I’m trying to figure out a correct way to do moving window DistinctCount outside Pinot.
Looks like
DISTINCTCOUNTRAWHLL
and
DistinctCountRawThetaSketch
both provides hexString that application could further process.
m
Thank you. you can join #pinot-docsrus on instructions on how to add it
👍 1
The hex string is on the retrieval side.
c
^^ I know. My understanding is, 1. You’d need to build a binary string with
sketch.compact().toByteArray()
as a column; 2. You’d need to do
distinctCountThetaSketch
to get count, with
postAggregationExpressionToEvaluate
which in most cases would match
where
clause and would be evaluated on brokers; 3. You could get the raw data via
DistinctCountRawThetaSketch
in query for HexEncoded Serialized Sketch Bytes.