we found if a STRING column has tab characters in ...
# general
t
we found if a STRING column has tab characters in it (i.e., \t) and used as a group-by key. The query result will return two group-by keys for a result. anyone aware of this bug or walk around it?
n
the group key generator uses tabs as delimiter
i guess we’d have to change delimiter or configure it somehow, to support this
@Jackie wdyt ?
t
yes. found this code here
Copy code
@Override
public void dumpToGroupByResults(LinkedList<GroupByResult> dest) {
  GroupKeyResultPair groupKeyResultPair;
  while ((groupKeyResultPair = _heap.poll()) != null) {
    // Set limit to -1 to prevent removing trailing empty strings
    String[] groupKeys = groupKeyResultPair._groupKey.split(GroupKeyGenerator.DELIMITER, -1);
j
Yes, that’s correct
We can change the delimiter
n
Ting, do you want to submit the fix?
t
I can file an issue first and happy to fix it. As for the fix, I need to check the code again. I do not fully understand why we need to do SPLIT in the first place.
n
look at an implementation of GroupKeyGenerator e.g. DictionaryBasedGroupKeyGenerator. In the getGroupKey method, we append the group key strings, using tab
and so in all places later on, we split with tab. If you trace usages of `
Copy code
GroupKeyGenerator.DELIMITER
you’ll see them all
t
can we not concatenate the column values and then split them later? this is my main question.
👍 1
n
afaik, that would require a significant amount of changes in the core group by functionality
t
changing delimiter will introduce another config and it is hard to predict what special character users put into their Pinot table.
m
Since delimiter is transient per query, what’s the issue with changing the static final? \t seems like a poor choice for delimiter
t
In the end we walk around the issue by doing a trim ops in Presto.
Sorry for the late reply. We tried not to do a rebuild to change the delimiter. Also the table data may have other weird characters which make the new delimiter not working correctly.