This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

06/05/2023, 2:41 PM

This message was deleted.

Clint Wylie

06/06/2023, 2:00 AM

off the top of my head i’m not sure if this can be done with SQL right now at least

Clint Wylie

06/06/2023, 2:01 AM

that said, i think it would be possible to make a ‘native’ query that could do it using the undocumented ‘expression’ aggregator

Clint Wylie

06/06/2023, 2:01 AM

which is what makes things like

ARRAY_AGG

work behind the scenes and is able to model most aggregators using native druid expressions

Clint Wylie

06/06/2023, 2:03 AM

well, im not sure if there is a clean way to make an ‘empty’ json object so that might be a bit rough too

Clint Wylie

06/06/2023, 2:03 AM

https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/aggregation/ExpressionLambdaAggregatorFactory.java#L103 lists the parameters it takes

Clint Wylie

06/06/2023, 2:06 AM

the

initialValue

would need to be a json object.. i forget if i let

json_object

have zero arguments or not, ideally it would work though

Clint Wylie

06/06/2023, 2:06 AM

oh neat just tested, zero argument does work

Clint Wylie

06/06/2023, 2:09 AM

ok so

initialValue

would be

json_object()

i think, then the

fold

expression would be something like the native

fold

expression https://druid.apache.org/docs/latest/misc/math-expr.html#apply-functions to walk over the

json_keys(clicks_by_country)

Clint Wylie

06/06/2023, 2:09 AM

oh wait, i just noticed that object_id is not inside the nested column, maybe this won’t work yet either

Clint Wylie

06/06/2023, 2:10 AM

well, i guess could group by object_id and use the expression aggregator to combine the clicks by country might be approximately the right thing

Clint Wylie

06/06/2023, 2:11 AM

the engine cannot currently handle grouping on json blobs directly right now

Clint Wylie

06/06/2023, 2:13 AM

oh, actually i think json_value only works with constant path arguments right now, so my idea wont work

Paweł Motyka

06/06/2023, 8:11 AM

@Clint Wylie yup, I was working with native queries anyway and what I tried so far was: 1. use JSON_PATH as an expression within virtual column

paths

. If you define output type as a

STRING

you get N results (where N is the number of keys in given json). 2. apply

json_value

with a value from

paths

column Example here: DATA IN:

Copy code

{
  "time": "2023-06-05T16:30:00",
  "object_id": 1,
  "clicks_by_country": {
    "US": 200,
    "CA": 400,
    "JP": 700
  }
}

OUTPUT AFTER 1.: (THIS WORKS)

Copy code

__time object_id   paths
2023-06-05T16:30:00         1    $.US
2023-06-05T16:30:00         1    $.CA
2023-06-05T16:30:00         1    $.JP

EXPECTED OUTPUT AFTER 2.: (THIS IS NOT WORKING)

Copy code

__time object_id   paths    values
2023-06-05T16:30:00         1    $.US       200
2023-06-05T16:30:00         1    $.CA       400
2023-06-05T16:30:00         1    $.JP       700

In my head, that makes sense and should work, although AFAIK you cannot use the virtual column value on the same level where you defined it. So I am unable to reach it in one query. When I want to introduce another layer of groupby into that query (no biggie, we do this every day), I cannot push the JSON anywhere higher. It's only available in the lowest layer of the query

Paweł Motyka

06/09/2023, 10:47 AM

Bumping here!

☝️ 1

John Kowtko

06/09/2023, 1:02 PM

@Clint Wylie, I was able to get partway here: This is logically what I thought would work:

Copy code

with tmp as (
select distinct string_to_mv(mv_to_string(json_keys(clicks_by_country, '$.'), ','), ',')  country
  from inline_data
 group by 1
)
 select tmp.country, json_value("clicks_by_country", '$.' || tmp.country)  val
   from inline_data
  cross join tmp

In the above SQL -- • the contents of the CTE will work by itself and return a single column list of countries from the source data • I can also run the main query if I use a simple CTE, e.g.:

Copy code

with tmp as (
select 'US' country
)
 select tmp.country, json_value("clicks_by_country", '$.' || tmp.country)  val
   from inline_data
  cross join tmp

... which works. But my first SQL statement above as-is generates a query unsupported error. I then tried tricking it by cross joining the rows first before doing the JSON extract on clicks:

Copy code

with tmp_keys as (
select distinct string_to_mv(mv_to_string(json_keys(clicks_by_country, '$.'), ','), ',')  country
  from inline_data
 group by 1
), tmp as (
 select k.country, d.clicks_by_country
   from inline_data d cross join tmp_keys k
)
 select country, json_value("clicks_by_country", '$.' || country)  val
   from tmp

But it doesn't like that either. It will work with "select * from tmp" as the final query, that json_value() function just doesn't seem to like operating on a record set. Given the above tests and logic can you think of a variation that will work? Thanks. John

Paweł Motyka

06/12/2023, 11:48 AM

The solution here was to use unnest function with data transformed into array.

Copy code

{"time": "2023-06-05T16:30:00", "object_id": 1, "clicks_by_country": {[{"country": "US", "clicks": 200, {"country": "CA", "clicks": 400, "country": "JP", "clicks": 700}]}}

The query looks like:

Copy code

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "unnest",
    "base": {
      "type": "table",
      "name": "test_complex_data_unpacking"
    },
    "virtualColumn": {
      "type": "expression",
      "name": "impressions_by_country_flat",
      "expression": "\"impressions_by_country\"",
      "outputType": "COMPLEX"
    }
  },
  "granularity": "all",
  "intervals": "0000/3000",
  "dimensions": [
    "related_object_id",
    "v0"
    "v1"
  ],
  "virtualColumns": [    {
      "type": "nested-field",
      "columnName": "impressions_by_country_flat",
      "outputName": "v0",
      "expectedType": "STRING",
      "path": "$.country"
    }
    {
      "type": "nested-field",
      "columnName": "impressions_by_country_flat",
      "outputName": "v1",
      "expectedType": "LONG",
      "path": "$.impressions"
    }
    ]
}

I am not sure how to approach multiple columns that might require unnesting. Still digging.

2 Views

Open in Slack

Previous Next