What is the performance implications of defining a...
# general
p
What is the performance implications of defining a dimension field in a schema as a string with a max length of Integer.MaxValue (2GB if all space is fully used)?
This field is a stringified json, usually between 20 chars and 4000 chars (~20% of the time). What is the impact of json indexing by it?
m
Will there be filtering on this column? If so, may be better to use JSON if it is indeed JSON.
p
There could potentially be some filtering, yes, though right now the use-case is simply to extract fields within this json field. What do you mean by "use JSON if it is indeed JSON"?
m
Use JSON indexing
p
I had it initially but by default Pinot's String are set to 512 chars in length. For the index to truly be useful I need to set this field's length to Integer.MaxValue (2^31-1 chars). I'm trying to understand if having fields this large should be a concern.
m
What's the max size?
Raw strings don't have an inv index at the moment, which is why I suggested JSON indexing
p
2³¹ -1, realistically I think the largest I've seen was around ~20k chars
m
Yeah, it will have perf impact. Do you really need the entire 2GB JSON blob? Or is there specific fields you can extract out from it?
p
Can you quantify (very roughly) what the perf impact would be?
The issue is that JSON blob is dynamic, there is fixed schema. In some cases we have text translations in the payload, which can be quite large.
m
I don't have any concrete numbers for that. I'll have to see the overall data size, and query
Do you need those text translations?
If not, may be filtering them out.
p
My consumers do, yes.
m
I see
What's your overall latency SLA requirement?
p
< 1 second for queries which access json blob. <100ms for top-level concrete fields.
m
Ok, may have to do some quick test for the largest json blob.
👍 1
p
Preliminary results tell me the largest json blob is 40500 chars long.
m
I meant like actually querying
p
I am, it just takes a while to get long-term historical data.
m
I'd say just do one day may be? then we can extrapolate?
p
Largest in the past month was 40k
m
Of course, with actual data, it will be more accurate.
p
Unfortunately I can not share the actual data, it may contain PII
m
Oh I was suggesting for you to run it
p
Ah I see now what you meant, I'll try that tomorrow probably.