Hey everyone, I found pinot text index only suppor...
# general
m
Hey everyone, I found pinot text index only support standard analyzer, is there any plan to support custom analyzers, like elasticsearch ? Or could you give me some advice how to support it better if we do this feature?
m
Not at the moment. Could you describe the use case where you would need this in Pinot?
k
@User we worked around this limitation by doing the analysis in a Flink workflow, and using the resulting terms in a multi-valued string field that we used for queries (filtering). It doesn’t do true phrases, but we generate both one and two-term strings, and we do the same analysis for the user query, so it (almost) eliminates any false positives.
m
Thanks @User, always appreciate you help.
k
Thanks Ken. For my own understanding, whats the use case for custom analyzers.. It should be easy to make the analyzer pluggable.
m
Thanks for your reply.We have some Chinese fields that standard analyzer isn't suitable.
k
Got it.. feel free to contribute this feature.. we can help you
k
Hi @User - we have a lot of language-specific analyzers (16 total).
k
should be easy to make this configurable rt
m
Thanks @User. I'd like to contribute this feature.My main idea is to configure each text index field with its own analyzer.The configuration of the analyzer is similar to elasticsearch's custom analyzer:
Copy code
"analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
I consider put this config in
fieldConfigList
.Then I will let this configuration to the reference classes, like
LuceneTextIndexCreator
,
RealtimeLuceneTextIndexReader
,etc.. I've support it in my own pinot feature.But there are some problems I don't know how to handle: 1.How to support custom analyzer which is written by users, like extending
org.apache.lucene.analysis.Analyzer
, where to put these classes in pinot? 2.Currently I'm breaking a lot of classes methods and constructors, like
IndexCreationContext,SegmentGeneratorConfig
in SPI,
LuceneTextIndexReader's constructor
etc..I guess I should check configurations as much as possible before instancing, and try not to pollute the code.For example, lucene's classes should only exist in
pinot-segment-local
module. Could you give me some advice on how to contribute this feature better? If you also think this feature makes sense,I will try to take it and open a new issue on pinot's repository of github. :)
k
Looks good at high level.. let’s file an issue and continue the discussion there. @User would love to get your input here
k
We created a similar JSON-based format for specifying the analysis chain in our Flink workflow. Format looks like:
Copy code
"text_en_similarity": {
                        "char_filters": [
                                "similarity_newline_remapper",
                                "similarity_punct_remapper"
                        ],
                        "tokenizer": "StandardTokenizerFactory",
                        "token_filters": [
                                "LowerCaseFilterFactory",
                                "EnglishPossessiveFilterFactory",
                                "currency_remapper",
                                "similarity_stemmer_en",
                                "similarity_word_delimiter",
                                "number_remapper",
                                "similarity_stemmer_en"
                        ]
                },
We used the
xxxFactory
names for things that are standard Lucene classes, and other names map to custom tokenizers or filters (either regular filters or char filters) which are then defined in subsequent sections of the JSON. Given what I understand of adding something like this to the configuration, I’d suggest having a new section that defines every required analysis chain, and then in text index field definitions you can optionally provide the name of the analyzer to use. Maybe that’s what @User was already proposing…
k
my only suggestion would be to flatten everything if possible so that it get read as properties.. I would like Pinot to be able to pass these properties to the analyzer without trying to understand the semantics of each property
k
Would it make sense to have a new
pinot-text-analyzer
plugin subdir? And then have a standard analyzer as the one plugin?
Also, the analyzers feel more like Pinot FS implementations. You want to be able to instantiate a named analyzer, which can often have a bunch of associated classes and data (stop words, stemming data, etc).
m
Thanks for your suggestions. I guess firstly I could only support analyzer's configuration from standard Lucene classes(maybe next step we could add analyzer plugin). For configuration format, I guess it's seems easier that each text index field has its total own analyzer, so the configuration won't change current configuration structure and backward compatibility, though different fields may have duplicate definition of configuration.So the table config may seems like:
Copy code
{
	"tableName": "transcript_analyzer_1",
    ...

	"fieldConfigList": [{
			"name": "text1",
			"encodingType": "RAW",
			"indexType": "TEXT",
			"analyzer": {
				"type": "custom",
                 ...
			}
		},
		{
			"name": "text2",
			"encodingType": "RAW",
			"indexType": "TEXT",

			"analyzer": {
				"type": "standard",
                 ...

			}
		}
	]
}
k
Lgtm
I don’t think this warrants a plugin because Pinot does not understand this
Do you plan to add a custom or use one already in lucene
m
I plan to use custom analyzer already in lucene, because it could support rich function with different configuration.
k
Ok.. then this should be simple
k
You also want to think about where supporting files live, and how they get accessed. Most users of Lucene analyzers wind up customizing behavior (stop words, protected words, etc). It would be hard to do this inside the schema, as you’d have to support JSON equivalent fields for all of the files, and re-work the analyzers to load from that versus from files found on the classpath, etc.