Hi , I have Text index up and running and looks wo...
# general
m
Hi , I have Text index up and running and looks working. However I noticed that some results are not correct for eg:- when I search for
40F916FD-F2A7-2255-FEFB-B43050D8A5EE
. I get results for
81753586-72E1-8DC1-FEFB-08DB16E6A793
&
40F916FD-F2A7-2255-FEFB-B43050D8A5E
. Trying to understand why it is so. Also If I try to search for XML tags like
</ns1:requestControlID>
it throws an error. Is there any setting I can enable to make these searches work?
m
@Sidd ^^
m
I tried to add escape char and the XML tags search looks working. However I still face the first issue where alpha numeric with hypens search providing wrong results..!
s
Hi @Matt can you point me to your queries?
m
@Sidd Thanks please see the query select
Copy code
select DATETIMECONVERT((timemillis/1000), '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss tz(America/New_York)', '1:SECONDS'),log  from local.mytable where TEXT_MATCH(log,'40F916FD-F2A7-2255-FEFB-B43050D8A5EE')
I tried without date formatting and still the same result
Copy code
select timemillis,log  from local.mytable where TEXT_MATCH(log,'40F916FD-F2A7-2255-FEFB-B43050D8A5EE')
another example is
75BC2F92-68A6-9237-6101-B6A778F602B8
matching
65FC22F3-68A6-B63C-31F4-AFE236D5DCFF
and
75BC2F92-68A6-9237-6101-B6A778F602B8
s
"-" is used as a word/term separator during index generation. So essentially the original string is tokenized and FEFB is one of the 5 terms that gets indexed Now on the query side, the same text parser and analyzer is used to tokenize the search query and it essentially becomes a multi term query with OR operator
(40F916FD OR F2A7 OR 2255 OR  FEFB OR B43050D8A5EE)
This is the reason why both strings are matching since both of them have the term FEFB You should change the query for Lucene to take it as a phrase query
WHERE TEXT_MATCH(log, '\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')
this will take the search string as one single phrase and will only match documents containing this as is.
@Matt ^^
I have verified this locally
m
@Sidd, I tried it however the phrase query is providing me with same results
s
It worked for me. How are you using the phrase query?
m
When I changed it to this it started working ,
select timemillis,log from local.mytable where TEXT_MATCH(log,'40F916FD AND F2A7 AND 2255 AND FEFB AND B43050D8A5EE')
s
yes this works too and phrase query works as well
WHERE TEXT_MATCH(log, '\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')
enclosing the search string in double quotes
m
@Sidd I am trying as now
Copy code
select timemillis,log from local.mytable where TEXT_MATCH(log,'\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')
However I stll get the same results
I mean it gives me multiple matches
s
can you list a few matches?
m
sure
so when I try
Copy code
select timemillis,log  from local.mytable where TEXT_MATCH(log,'\"DAE49404-3788-B121-B37A-B901D4891622\"')
I get results matching
2F440BB9-B37A-C808-9DF3-A4BD90954C25
and 
DE5A2A09-3788-A426-F194-A72C82F65C33
 
03D81725-280E-B37A-C22C-BDDAF65E3525
  
DAE49404-3788-B121-B37A-B901D4891622
s
let me try to create a unit test out of this data and see what happens
👍 1
So I tried a test using the exact same data • Using the phrase query matches the document exactly once as expected • Using the regular query matches multiple documents due to splitting of terms as explained above Wondering why it is different for you. I had recently fixed a text index metadata related bug that can lead to incorrect results. It is not related to the query type. May be you just happen to hit that bug. This can typically happen based on a certain structure of text index directory. May be we can hop on a call and you can show me the index directory. Also, what are the exact contents for the log column in the rows for 2F440BB9-B37A-C808-9DF3-A4BD90954C25 etc. Is it just this or some other text as well?
m
@Sidd Thanks for testing it , I am using the docker image from apachepinot/pinot , I will try with the latest one. I assume I might have to reindex?
The logs contain lots of data
eg:-
Copy code
21:44:01,542 INFO filters.ServiceBoundryInitFilter - [2F440BB9-B37A-C808-9DF3-A4BD90954C25] Begin Transaction @ 12/2/20 4:44 PM, associated with session [76FE276697C037D04FEFA621C942BBB1].

21:44:01,544 DEBUG authentication.AnonymousAuthenticationFilter - [2F440BB9-B37A-C808-9DF3-A4BD90954C25] Populated SecurityContextHolder with anonymous token:
02:05:04,574 DEBUG intercept.FilterSecurityInterceptor - [DE5A2A09-3788-A426-F194-A72C82F65C33] RunAsManager did not change Authentication object
s
So each log line represents a row for log column?
m
thats correct
s
Ok, my guess is you are probably hitting that bug. So try with latest. Re-indexing is not needed.
m
great let me try that now...
i tried with the latest image but no luck
s
Can we have a call sometime today?
I would like to look at a few things live when you run queries
m
After disabling the Text index cache i.e. "enableQueryCacheForTextIndex": "false", the searches are returning correct values. Thanks @Sidd for your great support here in providing the solution.
👍 1