Monica
03/09/2022, 9:21 AMMayank
Ken Krugler
03/09/2022, 6:59 PMMayank
Kishore G
Monica
03/10/2022, 3:03 AMKishore G
Ken Krugler
03/14/2022, 9:45 PMKishore G
Monica
03/17/2022, 9:43 AM"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
I consider put this config in fieldConfigList
.Then I will let this configuration to the reference classes, like LuceneTextIndexCreator
,RealtimeLuceneTextIndexReader
,etc.. I've support it in my own pinot feature.But there are some problems I don't know how to handle:
1.How to support custom analyzer which is written by users, like extending org.apache.lucene.analysis.Analyzer
, where to put these classes in pinot?
2.Currently I'm breaking a lot of classes methods and constructors, like IndexCreationContext,SegmentGeneratorConfig
in SPI, LuceneTextIndexReader's constructor
etc..I guess I should check configurations as much as possible before instancing, and try not to pollute the code.For example, lucene's classes should only exist in pinot-segment-local
module. Could you give me some advice on how to contribute this feature better?
If you also think this feature makes sense,I will try to take it and open a new issue on pinot's repository of github. :)Kishore G
Ken Krugler
03/17/2022, 4:44 PM"text_en_similarity": {
"char_filters": [
"similarity_newline_remapper",
"similarity_punct_remapper"
],
"tokenizer": "StandardTokenizerFactory",
"token_filters": [
"LowerCaseFilterFactory",
"EnglishPossessiveFilterFactory",
"currency_remapper",
"similarity_stemmer_en",
"similarity_word_delimiter",
"number_remapper",
"similarity_stemmer_en"
]
},
We used the xxxFactory
names for things that are standard Lucene classes, and other names map to custom tokenizers or filters (either regular filters or char filters) which are then defined in subsequent sections of the JSON. Given what I understand of adding something like this to the configuration, I’d suggest having a new section that defines every required analysis chain, and then in text index field definitions you can optionally provide the name of the analyzer to use. Maybe that’s what @User was already proposing…Kishore G
Ken Krugler
03/17/2022, 5:43 PMpinot-text-analyzer
plugin subdir? And then have a standard analyzer as the one plugin?Ken Krugler
03/17/2022, 5:45 PMMonica
03/18/2022, 4:05 AM{
"tableName": "transcript_analyzer_1",
...
"fieldConfigList": [{
"name": "text1",
"encodingType": "RAW",
"indexType": "TEXT",
"analyzer": {
"type": "custom",
...
}
},
{
"name": "text2",
"encodingType": "RAW",
"indexType": "TEXT",
"analyzer": {
"type": "standard",
...
}
}
]
}
Kishore G
Kishore G
Kishore G
Monica
03/18/2022, 4:20 AMKishore G
Ken Krugler
03/18/2022, 3:07 PM