Hello there :wave: I'm developing something that u...
# getting-started
d
Hello there 👋 I'm developing something that uses Pinot, consuming straight from a new kafka topic. I was able to run everything I need and it is beautiful (thanks for the work on this project 💪 ) Now I'm trying to improve some things on my project and wondered if there is a way to use a schema registry instead of leaving the table schema inside the project itself. What I would like to happen: I have a json schema related to the topic Pinot will consume from and instead of manually editing/creating the table schema (as explained here in the docs), I would like for Pinot to read the JSON schema from my registry and automagically use it when ingesting. I'm not sure if the configs
stream.kafka.decoder.prop.schema.registry.rest.url
and
stream.kafka.decoder.prop.schema.registry.schema.name
could help me achieve this.
👋 1
r
I have been working on JSON schema inference recently, I'm curious how you would rank using a json schema vs inference
I guess if you have the schema already you don't want Pinot to mess around figuring it out. One of the problems with JSON schema is it allows variant types, so e.g. a field can be a string or a double, and this sort of thing can actually be better handled in some ways with inference (if the values for the field are only ever doubles then it will be inferred as a double, or if the string values have temporal locality then they can be handled better than creating a sparse column for a handful of values)
d
What exactly do you mean by work with inference? What I would like to do is to not have to keep the table schema together with my project. I would like to point out a registry URL and a schema name on the table config and Pinot would find its way around it.
r
What exactly do you mean by work with inference?
I have been working on a feature to infer schemas from JSON data, which is aimed at use cases where there is no schema for the records, but it could also support your use case reasonably well.
if you had access to such a feature, would you still want to be able to point pinot to the schema?
d
Probably not, inference would solve my problem. But maybe it would have some trouble dealing with the date format I need to use: RFC3339, which is basically a string with a very specific format...
r
exactly, inferring dates is one of the headaches
d
right now I have my dateTime field with the following config:
Copy code
{
  "name": "operationDate",
  "dataType": "STRING",
  "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSZ",
  "granularity": "1:MILLISECONDS"
}
A pretty dirty and ugly workaround would be to infer it as string and query it using the now available
LIKE
operation.... So if I wanted to group data by month, I could try something like
WHERE operationDate LIKE '2021-10%'
it could work...
m
Side note, schema in Pinot can potentially be different from upstream (if you use transforms, derived columns or not use some of the columns from upstream).
d
That's a good point 🤔