Hi, I'm testing out pinot as a possible solution ...
# getting-started
s
Hi, I'm testing out pinot as a possible solution for a problem we have. Our data structure is nested and I would like to use a combination of JSON indexing and complex type configuration. Both these features seem to work ok in isolation, however if I try and add both of them to the ingestionConfig only the JSON indexing applies. Is it possible to only use one or the other of these features? (apologies if I have missed that bit of the documentation)
m
can you show me your complex config? Only b/c by default the complex config flattens absolutely all fields at all depths, so it would flatten all JSON structures
s
I cannot show you the exact config, however I can produce a similar config; (Ill just make one in a txt editor, too hard to type into slack directly)
An event document might looks something like
Copy code
{
	"...",
	"auditFields": {
		"a": "b",
		"c": "d"
	},
	"results": [
		{
			"name": "result1",
			"value": 33.2
		},
		{
			"name": "result2",
			"value": 45.6
		}
	]
}
table config something like this;
Copy code
{
	"...",
	"ingestionConfigs":
		"transformConfigs": [
			{
				"columnName": "auditFields_json",
				"transformFunction": "jsonFormat(\"auditFields\")
			}
		],
		"complexTypeConfig": {
			"fieldsToUnnest": [
				"results"
			]
		},
	},
	"..."
}
So in this case the auditFields are not a fixed part of the schema, they are effectively meta data so I was hoping to put them into a field auditFields_json so that I do not need to model them in the schema.
However, the results are very much part of the schema, I have created fields for them using the '.' notation, ie. results.name & results.value
m
and when you include the json config it doesn’t do the unnesting anymore?
s
no it doesnt seem to
m
ok lemme try to reproduce
s
the 2 configs seem to work when used independently
Sorry, to be clear, it looks like the unnesting works, however the auditFields_json is empty
I guess that might be because as you say "default the complex config flattens absolutely all fields at all depths"
so in my case the auditFields has been flattened into auditFields.* fields
Although I had specified the fieldsToUnnest and "auditFields" is not in that array
m
unnesting only refers to arrays. It’s effectively saying should I create multiple rows/records from one json array. Flattening auditFields is different as its not an array but an object!
and it does only map one json object to one row
just a flattened row
s
yes, so in my case I waanted the unnesting of the array into n rows, however I also wanted to store the auditFields.* object as json, so that I do not need to know all the fields up front at schema creation (as it is meta data for support purposes)
m
annoyingly I think if you have complex config enabled it’s gonna flatten any maps in there (see https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/[…]not/segment/local/recordtransformer/ComplexTypeTransformer.java) So then I was thinking that a workaround would be to store auditFields as a json string and then hydrate it in a transform function, but as far as I can tell there isn’t a function that converts a JSON string to JSON object
we should create an issue for it though as its a valid use case
s
so as it is currently implemented I could create schema fields for some of the auditFields, ie "auditField.a" from the example above?
m
yup
s
That would do for my current investigation, however I think it is a valid use case. Without knowing a great deal about pinot at this point it feels to me like these transformations/complexTypeConfigs should be a pipeline of transformations to the document, but should only apply to the part of the document that they are configured to touch
m
so the bit of code that does the flattening does take in a list of columns
but right now it’s hard coded to all of them:
new ArrayList<>(record.getFieldToValueMap().keySet())
s
yes, then it has _fieldsToUnnest
m
yeh so we’d want something similar to that
where it only flattens the fields you provide
s
ok, thanks for your time helping to debug this
m
@Simon J my colleague @Seunghyun was looking at this and wanted to get some input on a couple of things. e.g. what configuration parameter(s) would you want to be added and some examples of what the expected output should be with each configuration?
s
Hi @Mark Needham sorry i missed this. To me it feels a bit like a pipeline, where the input is the raw event from Kafka, and then a bunch of transformations are applied to it before it is then stored in the schema (indexes applied etc). So one step of the pipeline would be unnesting, another might be transforming. At this point order becomes important.
The user could then order these various stages however they wished. I guess from the PoV of debugging it might be useful to be able to be able to see the state of the document between each step
So in my case, I would have a first stage that converts the "auditFields" field into json and stores in the auditFields_json field, it would then apply unnesting of the "results" field which would result in a 1->n record transformation
The steps in my case would also work in the opposite order, however there would likely be a performance hit as the "auditFields" to json transformation would happen after the unnesting so would happen n times per row, rather than 1. (where n could be any value >= 1)