i have a design / data model question. i am trying...
# troubleshooting
p
i have a design / data model question. i am trying to count distinct accounts allocated in ab test buckets for an ab test over a time interval. so i have three fields, one is event time, second one is account id and second one is a hash map serialized to string. key of the has map is the experiment id and the value is the experiment bucket id. so basically
{e1:b1,e2:b2,e3:b3...}
an account can be in multiple ab tests. the query is going to be
SELECT exp_id, event_ts, bucket_id, DISTINCTCOUNTHLL(account_id) FROM table WHERE exp_id = <exp id> AND event_ts > start_time and event_ts < end_time GROUP BY event_ts, bucket_id
. event_ts has a granularity of some time interval so it is not a problem of high cardinality.
m
Perhaps have the schema with columns in your query?
p
I am thinking of having a map field in avro schema mapped to a multi value string field in pinot table schema, and then at the time of query i can use
MAP_VALUE(mapColumn, 'myKey', valueColumn)
to get bucket id. i am about to give it a try.
now the hashmap is actually serialized as
e1:e1#b1&e2:e2#b2...
so trying to use groovy script to add a derived complex field for quick testing
m
If you have the option, serialize the map to JSON, then you can use json extraction/transformation functions to map it to columns that are in your query
p
Yeah that was the second option I looked at. I was hoping to not write a flink job to transform the field for fast poc but looks like groovy script isn't working because it is not a simple expression.
in case anyone stumbles here later, this is how i was able to make the groovy expression work
Copy code
"ingestionConfig": {
    "transformConfigs": [
      {
        "columnName": "ab_test_map",
        "transformFunction": "Groovy({string_map.split('&').collectEntries{ entry -> [(entry.split(':').first()): entry.split(':').last()]}}, string_map)"
      }
    ]
  },
@Mayank would you have the json field as
{"exp":[{"expId": "e1", "bucketId": "b1"},{"expId": "e2", "bucketId": "b2"}...]}
? how do i query that? i will have value of an experiment id i.e.
e1
as a parameter for the query.
m
No I am saying, looking at your query, you don’t need to store the serialized map. You can use transforms to directly extract and store columns from input into a flat schema
p
well the size of the map is not fixed so i am not sure i follow how to directly extract to columns for a flat schema
m
Not sure if I follow. If values not in map, they can be extracted as null?
p
the key in the hashmap is the actual experiment id, and the value is the bucket id. i am assuming that you are talking about ingestion transform, unless you are talking about transform at the time of query.
m
Oh I see. Have you looked at json index? Looking at your query though, wondering if the upstream events could have been experiment based and not user based
p
that change will take a long time to happen. i can however explode one event into multiple events with one event per entry in map if it comes down to that.
meanwhile i transformed value in
myfield
from
e1:e1#b1&e2:e2#b2..
to
e1:e1#b1,e2:e2#b2,...
and the field in schema is
{
"name": "myfield",
"dataType": "STRING",
"singleValueField": false
}
and then tried this query
Copy code
SELECT VALUEIN(myfield, 'e1:e1#b1') AS bucket_id, COUNT(*) FROM mytable GROUP BY bucket_id
but it returns nothing. i do see entries with that though so not sure what am i missing
m
Hmm not sure if valueIn works like that. @Jackie ?
p
i am following
Copy code
SELECT VALUEIN(RandomAirports, 'SEA') as airport, count(*)
FROM airlineStats 
GROUP BY airport
from https://docs.pinot.apache.org/configuration-reference/functions/valuein#signature with data like
Copy code
RandomAirports
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS,PDX,LAX,EWR,SFO,PIT,RDU,RAP,LSE,SAN,SBN,IAH,OAK,BRO,JFK,SAT,ORD,ACY,DFW,BWI,TPA,BFL,BOS,SNA,ISN
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS,PDX,LAX,EWR,SFO,PIT,RDU,RAP,LSE,SAN,SBN,IAH,OAK
SEA,PSC,PHX,MSY,ATL,TYS,DEN
null
null
null
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS,PDX,LAX,EWR,SFO,PIT,RDU,RAP,LSE,SAN
m
Ah ok
p
if anyone stumbles across this i was able to get it working by using
Copy code
"ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "derivedfield",
          "transformFunction": "Groovy({myfield.split('&')}, myfield)"
        }
      ]
    },
and then doing this query
Copy code
SELECT VALUEIN(derivedfield, 'e1:e1#b1', 'e1:e1#b2') AS bucket_id,
       DISTINCTCOUNTHLL(account_id) AS distinct_allocations
FROM   allocation_events
GROUP  BY bucket_id
to get final response like
Copy code
bucket_id	distinct_allocations
e1:e1#b1	403
e1:e1#b2	418
and adding in event time as another select parameter and group by parameter works as well
j
@Priyank Bagrecha Can you try adding a filter
derivedfield = 'e1:e1#b1'
in the query?
Currently the query needs to scan all the documents
p
I can't because derivedField can be
e1:e1#b1.e2:e2#b2,...
etc but yeah what i have right now wont scale well as it scans all documents
j
Adding a filter with the same values in the
VALUE_IN
should give you the same result without scanning all the documents
SELECT VALUEIN(myfield, 'e1:e1#b1') AS bucket_id, COUNT(*) FROM mytable GROUP BY bucket_id
-> ``SELECT VALUEIN(myfield, 'e1:e1#b1') AS bucket_id, COUNT(*) FROM mytable WHERE myfield = 'e1:e1#b1' GROUP BY bucket_id`
Copy code
SELECT VALUEIN(derivedfield, 'e1:e1#b1', 'e1:e1#b2') AS bucket_id,
       DISTINCTCOUNTHLL(account_id) AS distinct_allocations
FROM   allocation_events
GROUP  BY bucket_id
->
Copy code
SELECT VALUEIN(derivedfield, 'e1:e1#b1', 'e1:e1#b2') AS bucket_id,
       DISTINCTCOUNTHLL(account_id) AS distinct_allocations
FROM   allocation_events
WHERE  derivedfield IN ('e1:e1#b1', 'e1:e1#b2')
GROUP  BY bucket_id
p
interesting, it works great. i am still confused why it works lol. for sake of simplicity i'll replace `e1:e1#b1`by
e1:b1
etc. i have events where derivedField =
e1:b1,e2:b2.e3:b3...
so to me derivedField doesn't exist in ('e1:b1', 'e1:b2'). how does this work? is derivedField in evaluating every element in the array instead of evaluating entire array?
j
It is modeled as a MV column, and Pinot will match every element with the predicate