https://pinot.apache.org/ logo
#general
Title
# general
m

Matt

12/02/2020, 10:50 PM
Hi , I have Text index up and running and looks working. However I noticed that some results are not correct for eg:- when I search for
40F916FD-F2A7-2255-FEFB-B43050D8A5EE
. I get results for
81753586-72E1-8DC1-FEFB-08DB16E6A793
&
40F916FD-F2A7-2255-FEFB-B43050D8A5E
. Trying to understand why it is so. Also If I try to search for XML tags like
</ns1:requestControlID>
it throws an error. Is there any setting I can enable to make these searches work?
m

Mayank

12/02/2020, 10:51 PM
@Sidd ^^
m

Matt

12/02/2020, 11:06 PM
I tried to add escape char and the XML tags search looks working. However I still face the first issue where alpha numeric with hypens search providing wrong results..!
s

Sidd

12/02/2020, 11:08 PM
Hi @Matt can you point me to your queries?
m

Matt

12/02/2020, 11:15 PM
@Sidd Thanks please see the query select
Copy code
select DATETIMECONVERT((timemillis/1000), '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss tz(America/New_York)', '1:SECONDS'),log  from local.mytable where TEXT_MATCH(log,'40F916FD-F2A7-2255-FEFB-B43050D8A5EE')
I tried without date formatting and still the same result
Copy code
select timemillis,log  from local.mytable where TEXT_MATCH(log,'40F916FD-F2A7-2255-FEFB-B43050D8A5EE')
another example is
75BC2F92-68A6-9237-6101-B6A778F602B8
matching
65FC22F3-68A6-B63C-31F4-AFE236D5DCFF
and
75BC2F92-68A6-9237-6101-B6A778F602B8
s

Sidd

12/02/2020, 11:31 PM
"-" is used as a word/term separator during index generation. So essentially the original string is tokenized and FEFB is one of the 5 terms that gets indexed Now on the query side, the same text parser and analyzer is used to tokenize the search query and it essentially becomes a multi term query with OR operator
(40F916FD OR F2A7 OR 2255 OR  FEFB OR B43050D8A5EE)
This is the reason why both strings are matching since both of them have the term FEFB You should change the query for Lucene to take it as a phrase query
WHERE TEXT_MATCH(log, '\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')
this will take the search string as one single phrase and will only match documents containing this as is.
@Matt ^^
I have verified this locally
m

Matt

12/02/2020, 11:38 PM
@Sidd, I tried it however the phrase query is providing me with same results
s

Sidd

12/02/2020, 11:40 PM
It worked for me. How are you using the phrase query?
m

Matt

12/02/2020, 11:41 PM
When I changed it to this it started working ,
select timemillis,log from local.mytable where TEXT_MATCH(log,'40F916FD AND F2A7 AND 2255 AND FEFB AND B43050D8A5EE')
s

Sidd

12/02/2020, 11:43 PM
yes this works too and phrase query works as well
WHERE TEXT_MATCH(log, '\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')
enclosing the search string in double quotes
m

Matt

12/02/2020, 11:44 PM
@Sidd I am trying as now
Copy code
select timemillis,log from local.mytable where TEXT_MATCH(log,'\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')
However I stll get the same results
I mean it gives me multiple matches
s

Sidd

12/02/2020, 11:44 PM
can you list a few matches?
m

Matt

12/02/2020, 11:45 PM
sure
so when I try
Copy code
select timemillis,log  from local.mytable where TEXT_MATCH(log,'\"DAE49404-3788-B121-B37A-B901D4891622\"')
I get results matching
2F440BB9-B37A-C808-9DF3-A4BD90954C25
and 
DE5A2A09-3788-A426-F194-A72C82F65C33
 
03D81725-280E-B37A-C22C-BDDAF65E3525
  
DAE49404-3788-B121-B37A-B901D4891622
s

Sidd

12/02/2020, 11:53 PM
let me try to create a unit test out of this data and see what happens
👍 1
So I tried a test using the exact same data • Using the phrase query matches the document exactly once as expected • Using the regular query matches multiple documents due to splitting of terms as explained above Wondering why it is different for you. I had recently fixed a text index metadata related bug that can lead to incorrect results. It is not related to the query type. May be you just happen to hit that bug. This can typically happen based on a certain structure of text index directory. May be we can hop on a call and you can show me the index directory. Also, what are the exact contents for the log column in the rows for 2F440BB9-B37A-C808-9DF3-A4BD90954C25 etc. Is it just this or some other text as well?
m

Matt

12/03/2020, 12:38 AM
@Sidd Thanks for testing it , I am using the docker image from apachepinot/pinot , I will try with the latest one. I assume I might have to reindex?
The logs contain lots of data
eg:-
Copy code
21:44:01,542 INFO filters.ServiceBoundryInitFilter - [2F440BB9-B37A-C808-9DF3-A4BD90954C25] Begin Transaction @ 12/2/20 4:44 PM, associated with session [76FE276697C037D04FEFA621C942BBB1].

21:44:01,544 DEBUG authentication.AnonymousAuthenticationFilter - [2F440BB9-B37A-C808-9DF3-A4BD90954C25] Populated SecurityContextHolder with anonymous token:
02:05:04,574 DEBUG intercept.FilterSecurityInterceptor - [DE5A2A09-3788-A426-F194-A72C82F65C33] RunAsManager did not change Authentication object
s

Sidd

12/03/2020, 12:40 AM
So each log line represents a row for log column?
m

Matt

12/03/2020, 12:40 AM
thats correct
s

Sidd

12/03/2020, 12:42 AM
Ok, my guess is you are probably hitting that bug. So try with latest. Re-indexing is not needed.
m

Matt

12/03/2020, 12:42 AM
great let me try that now...
i tried with the latest image but no luck
s

Sidd

12/03/2020, 5:05 PM
Can we have a call sometime today?
I would like to look at a few things live when you run queries
m

Matt

12/04/2020, 8:40 PM
After disabling the Text index cache i.e. "enableQueryCacheForTextIndex": "false", the searches are returning correct values. Thanks @Sidd for your great support here in providing the solution.
👍 1