i have a design data model question i am trying to count dis Apache Pinot #troubleshooting

i have a design / data model question. i am trying...

Priyank Bagrecha

06/21/2022, 11:24 PM

i have a design / data model question. i am trying to count distinct accounts allocated in ab test buckets for an ab test over a time interval. so i have three fields, one is event time, second one is account id and second one is a hash map serialized to string. key of the has map is the experiment id and the value is the experiment bucket id. so basically

{e1:b1,e2:b2,e3:b3...}

an account can be in multiple ab tests. the query is going to be

SELECT exp_id, event_ts, bucket_id, DISTINCTCOUNTHLL(account_id) FROM table WHERE exp_id = <exp id> AND event_ts > start_time and event_ts < end_time GROUP BY event_ts, bucket_id

. event_ts has a granularity of some time interval so it is not a problem of high cardinality.

Mayank

06/22/2022, 1:35 AM

Perhaps have the schema with columns in your query?

Priyank Bagrecha

06/22/2022, 1:41 AM

I am thinking of having a map field in avro schema mapped to a multi value string field in pinot table schema, and then at the time of query i can use

MAP_VALUE(mapColumn, 'myKey', valueColumn)

to get bucket id. i am about to give it a try.

Priyank Bagrecha

06/22/2022, 1:42 AM

now the hashmap is actually serialized as

e1:e1#b1&e2:e2#b2...

so trying to use groovy script to add a derived complex field for quick testing

Mayank

06/22/2022, 1:58 AM

If you have the option, serialize the map to JSON, then you can use json extraction/transformation functions to map it to columns that are in your query

Priyank Bagrecha

06/22/2022, 1:59 AM

Yeah that was the second option I looked at. I was hoping to not write a flink job to transform the field for fast poc but looks like groovy script isn't working because it is not a simple expression.

Priyank Bagrecha

06/22/2022, 2:22 AM

in case anyone stumbles here later, this is how i was able to make the groovy expression work

Copy code

"ingestionConfig": {
    "transformConfigs": [
      {
        "columnName": "ab_test_map",
        "transformFunction": "Groovy({string_map.split('&').collectEntries{ entry -> [(entry.split(':').first()): entry.split(':').last()]}}, string_map)"
      }
    ]
  },

Priyank Bagrecha

06/22/2022, 3:10 AM

@Mayank would you have the json field as

{"exp":[{"expId": "e1", "bucketId": "b1"},{"expId": "e2", "bucketId": "b2"}...]}

? how do i query that? i will have value of an experiment id i.e.

e1

as a parameter for the query.

Mayank

06/22/2022, 3:28 AM

No I am saying, looking at your query, you don’t need to store the serialized map. You can use transforms to directly extract and store columns from input into a flat schema

Priyank Bagrecha

06/22/2022, 4:39 AM

well the size of the map is not fixed so i am not sure i follow how to directly extract to columns for a flat schema

Mayank

06/22/2022, 5:04 AM

Not sure if I follow. If values not in map, they can be extracted as null?

Priyank Bagrecha

06/22/2022, 5:10 AM

the key in the hashmap is the actual experiment id, and the value is the bucket id. i am assuming that you are talking about ingestion transform, unless you are talking about transform at the time of query.

Mayank

06/22/2022, 5:42 AM

Oh I see. Have you looked at json index? Looking at your query though, wondering if the upstream events could have been experiment based and not user based

Priyank Bagrecha

06/22/2022, 5:44 AM

that change will take a long time to happen. i can however explode one event into multiple events with one event per entry in map if it comes down to that.

Priyank Bagrecha

06/22/2022, 5:46 AM

meanwhile i transformed value in

myfield

from

e1:e1#b1&e2:e2#b2..

e1:e1#b1,e2:e2#b2,...

and the field in schema is

"name": "myfield",

"dataType": "STRING",

"singleValueField": false

and then tried this query

Copy code

SELECT VALUEIN(myfield, 'e1:e1#b1') AS bucket_id, COUNT(*) FROM mytable GROUP BY bucket_id

but it returns nothing. i do see entries with that though so not sure what am i missing

Mayank

06/22/2022, 5:47 AM

Hmm not sure if valueIn works like that. @Jackie ?

Priyank Bagrecha

06/22/2022, 5:49 AM

i am following

Copy code

SELECT VALUEIN(RandomAirports, 'SEA') as airport, count(*)
FROM airlineStats 
GROUP BY airport

from https://docs.pinot.apache.org/configuration-reference/functions/valuein#signature with data like

Copy code

RandomAirports
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS,PDX,LAX,EWR,SFO,PIT,RDU,RAP,LSE,SAN,SBN,IAH,OAK,BRO,JFK,SAT,ORD,ACY,DFW,BWI,TPA,BFL,BOS,SNA,ISN
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS,PDX,LAX,EWR,SFO,PIT,RDU,RAP,LSE,SAN,SBN,IAH,OAK
SEA,PSC,PHX,MSY,ATL,TYS,DEN
null
null
null
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS
SEA,PSC,PHX,MSY,ATL,TYS,DEN,CHS,PDX,LAX,EWR,SFO,PIT,RDU,RAP,LSE,SAN

Mayank

06/22/2022, 5:49 AM

Ah ok

Priyank Bagrecha

06/22/2022, 7:13 AM

if anyone stumbles across this i was able to get it working by using

Copy code

"ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "derivedfield",
          "transformFunction": "Groovy({myfield.split('&')}, myfield)"
        }
      ]
    },

and then doing this query

Copy code

SELECT VALUEIN(derivedfield, 'e1:e1#b1', 'e1:e1#b2') AS bucket_id,
       DISTINCTCOUNTHLL(account_id) AS distinct_allocations
FROM   allocation_events
GROUP  BY bucket_id

to get final response like

Copy code

bucket_id	distinct_allocations
e1:e1#b1	403
e1:e1#b2	418

Priyank Bagrecha

06/22/2022, 7:14 AM

and adding in event time as another select parameter and group by parameter works as well

Jackie

06/22/2022, 4:58 PM

@Priyank Bagrecha Can you try adding a filter

derivedfield = 'e1:e1#b1'

in the query?

Jackie

06/22/2022, 4:58 PM

Currently the query needs to scan all the documents

Priyank Bagrecha

06/22/2022, 5:02 PM

I can't because derivedField can be

e1:e1#b1.e2:e2#b2,...

etc but yeah what i have right now wont scale well as it scans all documents

Jackie

06/22/2022, 6:29 PM

Adding a filter with the same values in the

VALUE_IN

should give you the same result without scanning all the documents

Jackie

06/22/2022, 6:29 PM

SELECT VALUEIN(myfield, 'e1:e1#b1') AS bucket_id, COUNT(*) FROM mytable GROUP BY bucket_id

-> ``SELECT VALUEIN(myfield, 'e1:e1#b1') AS bucket_id, COUNT(*) FROM mytable WHERE myfield = 'e1:e1#b1' GROUP BY bucket_id`

Jackie

06/22/2022, 6:30 PM

Copy code

SELECT VALUEIN(derivedfield, 'e1:e1#b1', 'e1:e1#b2') AS bucket_id,
       DISTINCTCOUNTHLL(account_id) AS distinct_allocations
FROM   allocation_events
GROUP  BY bucket_id

Copy code

SELECT VALUEIN(derivedfield, 'e1:e1#b1', 'e1:e1#b2') AS bucket_id,
       DISTINCTCOUNTHLL(account_id) AS distinct_allocations
FROM   allocation_events
WHERE  derivedfield IN ('e1:e1#b1', 'e1:e1#b2')
GROUP  BY bucket_id

Priyank Bagrecha

06/22/2022, 6:38 PM

interesting, it works great. i am still confused why it works lol. for sake of simplicity i'll replace `e1:e1#b1`by

e1:b1

etc. i have events where derivedField =

e1:b1,e2:b2.e3:b3...

so to me derivedField doesn't exist in ('e1:b1', 'e1:b2'). how does this work? is derivedField in evaluating every element in the array instead of evaluating entire array?

Jackie

06/23/2022, 5:04 PM

It is modeled as a MV column, and Pinot will match every element with the predicate

Open in Slack

Previous Next