Hi I m testing out pinot as a possible solution for a proble Apache Pinot #getting-started

Hi, I'm testing out pinot as a possible solution ...

Simon J

04/05/2023, 12:05 PM

Hi, I'm testing out pinot as a possible solution for a problem we have. Our data structure is nested and I would like to use a combination of JSON indexing and complex type configuration. Both these features seem to work ok in isolation, however if I try and add both of them to the ingestionConfig only the JSON indexing applies. Is it possible to only use one or the other of these features? (apologies if I have missed that bit of the documentation)

Mark Needham

04/05/2023, 12:12 PM

can you show me your complex config? Only b/c by default the complex config flattens absolutely all fields at all depths, so it would flatten all JSON structures

Simon J

04/05/2023, 12:19 PM

I cannot show you the exact config, however I can produce a similar config; (Ill just make one in a txt editor, too hard to type into slack directly)

Simon J

04/05/2023, 12:29 PM

An event document might looks something like

Copy code

{
	"...",
	"auditFields": {
		"a": "b",
		"c": "d"
	},
	"results": [
		{
			"name": "result1",
			"value": 33.2
		},
		{
			"name": "result2",
			"value": 45.6
		}
	]
}

Simon J

04/05/2023, 12:29 PM

table config something like this;

Simon J

04/05/2023, 12:29 PM

Copy code

{
	"...",
	"ingestionConfigs":
		"transformConfigs": [
			{
				"columnName": "auditFields_json",
				"transformFunction": "jsonFormat(\"auditFields\")
			}
		],
		"complexTypeConfig": {
			"fieldsToUnnest": [
				"results"
			]
		},
	},
	"..."
}

Simon J

04/05/2023, 12:32 PM

So in this case the auditFields are not a fixed part of the schema, they are effectively meta data so I was hoping to put them into a field auditFields_json so that I do not need to model them in the schema.

Simon J

04/05/2023, 12:33 PM

However, the results are very much part of the schema, I have created fields for them using the '.' notation, ie. results.name & results.value

Mark Needham

04/05/2023, 1:02 PM

and when you include the json config it doesn’t do the unnesting anymore?

Simon J

04/05/2023, 1:03 PM

no it doesnt seem to

Mark Needham

04/05/2023, 1:04 PM

ok lemme try to reproduce

Simon J

04/05/2023, 1:04 PM

the 2 configs seem to work when used independently

Simon J

04/05/2023, 1:07 PM

Sorry, to be clear, it looks like the unnesting works, however the auditFields_json is empty

Simon J

04/05/2023, 1:07 PM

I guess that might be because as you say "default the complex config flattens absolutely all fields at all depths"

Simon J

04/05/2023, 1:08 PM

so in my case the auditFields has been flattened into auditFields.* fields

Simon J

04/05/2023, 1:08 PM

Although I had specified the fieldsToUnnest and "auditFields" is not in that array

Mark Needham

04/05/2023, 1:10 PM

unnesting only refers to arrays. It’s effectively saying should I create multiple rows/records from one json array. Flattening auditFields is different as its not an array but an object!

Mark Needham

04/05/2023, 1:10 PM

and it does only map one json object to one row

Mark Needham

04/05/2023, 1:10 PM

just a flattened row

Simon J

04/05/2023, 1:12 PM

yes, so in my case I waanted the unnesting of the array into n rows, however I also wanted to store the auditFields.* object as json, so that I do not need to know all the fields up front at schema creation (as it is meta data for support purposes)

Mark Needham

04/05/2023, 1:21 PM

annoyingly I think if you have complex config enabled it’s gonna flatten any maps in there (see https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/[…]not/segment/local/recordtransformer/ComplexTypeTransformer.java) So then I was thinking that a workaround would be to store auditFields as a json string and then hydrate it in a transform function, but as far as I can tell there isn’t a function that converts a JSON string to JSON object

Mark Needham

04/05/2023, 1:21 PM

we should create an issue for it though as its a valid use case

Simon J

04/05/2023, 1:22 PM

so as it is currently implemented I could create schema fields for some of the auditFields, ie "auditField.a" from the example above?

Mark Needham

04/05/2023, 1:23 PM

yup

Simon J

04/05/2023, 1:25 PM

That would do for my current investigation, however I think it is a valid use case. Without knowing a great deal about pinot at this point it feels to me like these transformations/complexTypeConfigs should be a pipeline of transformations to the document, but should only apply to the part of the document that they are configured to touch

Mark Needham

04/05/2023, 1:25 PM

https://github.com/apache/pinot/issues/10549

Mark Needham

04/05/2023, 1:26 PM

so the bit of code that does the flattening does take in a list of columns

Mark Needham

04/05/2023, 1:26 PM

but right now it’s hard coded to all of them:

new ArrayList<>(record.getFieldToValueMap().keySet())

Simon J

04/05/2023, 1:26 PM

yes, then it has _fieldsToUnnest

Mark Needham

04/05/2023, 1:27 PM

yeh so we’d want something similar to that

Mark Needham

04/05/2023, 1:27 PM

where it only flattens the fields you provide

Simon J

04/05/2023, 1:27 PM

ok, thanks for your time helping to debug this

Mark Needham

04/14/2023, 1:56 PM

@Simon J my colleague @Seunghyun was looking at this and wanted to get some input on a couple of things. e.g. what configuration parameter(s) would you want to be added and some examples of what the expected output should be with each configuration?

Simon J

04/20/2023, 11:29 AM

Hi @Mark Needham sorry i missed this. To me it feels a bit like a pipeline, where the input is the raw event from Kafka, and then a bunch of transformations are applied to it before it is then stored in the schema (indexes applied etc). So one step of the pipeline would be unnesting, another might be transforming. At this point order becomes important.

Simon J

04/20/2023, 11:31 AM

The user could then order these various stages however they wished. I guess from the PoV of debugging it might be useful to be able to be able to see the state of the document between each step

Simon J

04/20/2023, 11:34 AM

So in my case, I would have a first stage that converts the "auditFields" field into json and stores in the auditFields_json field, it would then apply unnesting of the "results" field which would result in a 1->n record transformation

Simon J

04/20/2023, 11:35 AM

The steps in my case would also work in the opposite order, however there would likely be a performance hit as the "auditFields" to json transformation would happen after the unnesting so would happen n times per row, rather than 1. (where n could be any value >= 1)

Open in Slack

Previous Next