Slackbot
06/05/2023, 2:41 PMClint Wylie
06/06/2023, 2:00 AMClint Wylie
06/06/2023, 2:01 AMClint Wylie
06/06/2023, 2:01 AMARRAY_AGG
work behind the scenes and is able to model most aggregators using native druid expressionsClint Wylie
06/06/2023, 2:03 AMClint Wylie
06/06/2023, 2:03 AMClint Wylie
06/06/2023, 2:06 AMinitialValue
would need to be a json object.. i forget if i let json_object
have zero arguments or not, ideally it would work thoughClint Wylie
06/06/2023, 2:06 AMClint Wylie
06/06/2023, 2:09 AMinitialValue
would be json_object()
i think, then the fold
expression would be something like the native fold
expression https://druid.apache.org/docs/latest/misc/math-expr.html#apply-functions to walk over the json_keys(clicks_by_country)
Clint Wylie
06/06/2023, 2:09 AMClint Wylie
06/06/2023, 2:10 AMClint Wylie
06/06/2023, 2:11 AMClint Wylie
06/06/2023, 2:13 AMPaweł Motyka
06/06/2023, 8:11 AMpaths
. If you define output type as a STRING
you get N results (where N is the number of keys in given json).
2. apply json_value
with a value from paths
column
Example here:
DATA IN:
{
"time": "2023-06-05T16:30:00",
"object_id": 1,
"clicks_by_country": {
"US": 200,
"CA": 400,
"JP": 700
}
}
OUTPUT AFTER 1.: (THIS WORKS)
__time object_id paths
2023-06-05T16:30:00 1 $.US
2023-06-05T16:30:00 1 $.CA
2023-06-05T16:30:00 1 $.JP
EXPECTED OUTPUT AFTER 2.: (THIS IS NOT WORKING)
__time object_id paths values
2023-06-05T16:30:00 1 $.US 200
2023-06-05T16:30:00 1 $.CA 400
2023-06-05T16:30:00 1 $.JP 700
In my head, that makes sense and should work, although AFAIK you cannot use the virtual column value on the same level where you defined it. So I am unable to reach it in one query. When I want to introduce another layer of groupby into that query (no biggie, we do this every day), I cannot push the JSON anywhere higher. It's only available in the lowest layer of the queryPaweł Motyka
06/09/2023, 10:47 AMJohn Kowtko
06/09/2023, 1:02 PMwith tmp as (
select distinct string_to_mv(mv_to_string(json_keys(clicks_by_country, '$.'), ','), ',') country
from inline_data
group by 1
)
select tmp.country, json_value("clicks_by_country", '$.' || tmp.country) val
from inline_data
cross join tmp
In the above SQL --
• the contents of the CTE will work by itself and return a single column list of countries from the source data
• I can also run the main query if I use a simple CTE, e.g.:
with tmp as (
select 'US' country
)
select tmp.country, json_value("clicks_by_country", '$.' || tmp.country) val
from inline_data
cross join tmp
... which works.
But my first SQL statement above as-is generates a query unsupported error.
I then tried tricking it by cross joining the rows first before doing the JSON extract on clicks:
with tmp_keys as (
select distinct string_to_mv(mv_to_string(json_keys(clicks_by_country, '$.'), ','), ',') country
from inline_data
group by 1
), tmp as (
select k.country, d.clicks_by_country
from inline_data d cross join tmp_keys k
)
select country, json_value("clicks_by_country", '$.' || country) val
from tmp
But it doesn't like that either. It will work with "select * from tmp" as the final query, that json_value() function just doesn't seem to like operating on a record set.
Given the above tests and logic can you think of a variation that will work?
Thanks. JohnPaweł Motyka
06/12/2023, 11:48 AM{"time": "2023-06-05T16:30:00", "object_id": 1, "clicks_by_country": {[{"country": "US", "clicks": 200, {"country": "CA", "clicks": 400, "country": "JP", "clicks": 700}]}}
The query looks like:
{
"queryType": "groupBy",
"dataSource": {
"type": "unnest",
"base": {
"type": "table",
"name": "test_complex_data_unpacking"
},
"virtualColumn": {
"type": "expression",
"name": "impressions_by_country_flat",
"expression": "\"impressions_by_country\"",
"outputType": "COMPLEX"
}
},
"granularity": "all",
"intervals": "0000/3000",
"dimensions": [
"related_object_id",
"v0"
"v1"
],
"virtualColumns": [ {
"type": "nested-field",
"columnName": "impressions_by_country_flat",
"outputName": "v0",
"expectedType": "STRING",
"path": "$.country"
}
{
"type": "nested-field",
"columnName": "impressions_by_country_flat",
"outputName": "v1",
"expectedType": "LONG",
"path": "$.impressions"
}
]
}
I am not sure how to approach multiple columns that might require unnesting. Still digging.