Jonathan Meyer
06/13/2021, 10:59 AMSELECT SUM(value) FROM table WHERE user IN LOOKUP('group', 'user', 'groupId', '<groupId>')
? (which isn't valid)
Basically my goal is to fetch a list of 'users' in the dimTable using a 'group' identifier and filtering on those (in the main table)Jackie
06/14/2021, 1:40 AMUse IdSet for Id Filtering
design: https://docs.pinot.apache.org/developers/design-documents. This query can be modeled as a subquery and solved with the IN_SUBQUERY
functionJonathan Meyer
06/14/2021, 7:27 AMJonathan Meyer
06/14/2021, 9:33 AMIdSets.create(DataType dataType)
?
b. Creating from a query using SELECT ID_SET(userId) FROM table
i. With only 2 ids (string / hex data type) in the ID_SET
1. Default IdSet expectedInsertions
value returns an output (serialized IdSet
?) so long, Firefox truncates it and the Pinot UI doesn't display it
2. Using a smaller value (expectedInsertions:1000
) reduces the output size, but it is still very very long
3. Is that due to the data type of my ids ?
2. Querying:
⦠...Jackie
06/14/2021, 4:11 PMIN_SUBQUERY
function, no need to explicitly create an IdSet
Jonathan Meyer
06/14/2021, 4:12 PMIdSet
when data in not in a table already ?Jackie
06/14/2021, 4:14 PMSELECT SUM(value) FROM table1 WHERE IN_SUBQUERY(user, 'SELECT ID_SET(groupId) FROM group WHERE ...') = 1
Jackie
06/14/2021, 4:14 PMID_SET()
Jonathan Meyer
06/14/2021, 4:53 PMJonathan Meyer
06/14/2021, 4:53 PMJonathan Meyer
06/14/2021, 4:54 PMSELECT entityId, dateString, value FROM mainTable WHERE IN_SUBQUERY(entityId, 'SELECT ID_SET(entityId) FROM idsTable WHERE communityId = comm1') = 1
This returns no result, using = 0
returns all of them
Looks like it works without the WHERE
clause (IN_SUBQUERY
)
But performance is really impacted, a very simple query with basically no data takes nearly a second to return a result that isn't correctJackie
06/14/2021, 5:47 PMSELECT entityId, dateString, value FROM mainTable WHERE IN_SUBQUERY(entityId, 'SELECT ID_SET(entityId) FROM idsTable WHERE communityId = ''comm1''') = 1
Jackie
06/14/2021, 5:47 PMJackie
06/14/2021, 5:47 PMJonathan Meyer
06/14/2021, 5:51 PMcommunityId=comm
, both entityId [...]6a
and [...]6b
should come up, but 6b doesn'tcommunityId=comm1
only returns [...]6a
results as expectedJonathan Meyer
06/14/2021, 5:54 PMJonathan Meyer
06/14/2021, 5:58 PMJackie
06/14/2021, 6:04 PMJonathan Meyer
06/14/2021, 6:04 PMJonathan Meyer
06/14/2021, 6:05 PMJackie
06/14/2021, 6:05 PMJonathan Meyer
06/14/2021, 6:06 PMJackie
06/14/2021, 6:07 PMJonathan Meyer
06/14/2021, 6:08 PMJackie
06/14/2021, 6:08 PMJonathan Meyer
06/14/2021, 6:08 PMSELECT ID_SET(entityId) FROM idsTable
This query takes "forever" (~500ms), and returns a massive payloadJonathan Meyer
06/14/2021, 6:09 PMJackie
06/14/2021, 6:10 PMJonathan Meyer
06/14/2021, 6:10 PMResponseSize: 4562277
Jonathan Meyer
06/14/2021, 6:10 PMJonathan Meyer
06/14/2021, 6:10 PMJackie
06/14/2021, 6:11 PMJackie
06/14/2021, 6:11 PMJonathan Meyer
06/14/2021, 6:11 PMJackie
06/14/2021, 6:12 PMJonathan Meyer
06/14/2021, 6:13 PMJonathan Meyer
06/14/2021, 6:13 PMJonathan Meyer
06/14/2021, 6:14 PMJonathan Meyer
06/14/2021, 6:14 PMJonathan Meyer
06/14/2021, 6:15 PM4561917
Jackie
06/14/2021, 6:30 PMJackie
06/14/2021, 6:30 PM0.7.1
?Jonathan Meyer
06/14/2021, 6:30 PMJonathan Meyer
06/14/2021, 6:30 PMJonathan Meyer
06/14/2021, 6:31 PMJackie
06/14/2021, 6:32 PM0.7.0
is not a good buildJackie
06/14/2021, 6:32 PMJonathan Meyer
06/14/2021, 6:33 PMJonathan Meyer
06/14/2021, 6:33 PMJackie
06/14/2021, 6:34 PM0.7.1
then, we released 0.7.1
to replace 0.7.0
because that release contains some bugJonathan Meyer
06/14/2021, 6:35 PMJonathan Meyer
06/14/2021, 6:41 PMJonathan Meyer
06/14/2021, 6:43 PMJonathan Meyer
06/14/2021, 6:44 PM0.7.1
(non jdk-11)
Trying on latest nowJackie
06/14/2021, 6:45 PMJonathan Meyer
06/14/2021, 6:45 PMJackie
06/14/2021, 6:46 PMentityId
is a STRING
column rt?Jonathan Meyer
06/14/2021, 6:46 PMJonathan Meyer
06/14/2021, 6:46 PMFLOAT
column tooJackie
06/14/2021, 6:46 PMINT
column? I cannot reproduce the issue with INT
columnJonathan Meyer
06/14/2021, 6:47 PMJackie
06/14/2021, 6:47 PMIdSet
Jonathan Meyer
06/14/2021, 6:47 PMJonathan Meyer
06/14/2021, 6:48 PMJackie
06/14/2021, 6:51 PMJackie
06/14/2021, 6:51 PMJonathan Meyer
06/14/2021, 6:52 PMJackie
06/14/2021, 6:52 PMexpectedInsertions
to reduce the size of the bloom filterJackie
06/14/2021, 6:52 PMJonathan Meyer
06/14/2021, 6:53 PMJackie
06/14/2021, 6:54 PMJonathan Meyer
06/14/2021, 6:55 PMJackie
06/14/2021, 6:58 PMIdSet
is implemented as an interface, so we can optimize it to support set for non-integer fieldsJackie
06/14/2021, 6:58 PMJonathan Meyer
06/14/2021, 6:59 PMJackie
06/14/2021, 6:59 PMJonathan Meyer
06/14/2021, 7:00 PMJonathan Meyer
06/14/2021, 7:16 PMJonathan Meyer
07/02/2021, 7:43 AMJackie
07/02/2021, 3:52 PMJonathan Meyer
07/02/2021, 3:53 PMJonathan Meyer
07/02/2021, 3:53 PMid
How will I easily replace it ? (i.e. generate a new segment with the same name)Jackie
07/02/2021, 7:59 PMrefresh
use case, where all the segments are usually replaced all together. When pinot generates segments, it will append a sequence number to the table name as the segment name. If each time same data files are provided, then all the segments will be replaced properly. Do you think your use case can fit into this model?Jonathan Meyer
07/02/2021, 8:10 PMJackie
07/02/2021, 9:01 PMJonathan Meyer
07/02/2021, 10:12 PMSo basically you need to control the segment name generated from each group.
Yes, exactly
How do you generate segments now?
Right now it follows the
simple
strategy (default), with global IDs
Leading to segments names `<table>~_OFFLINE~_<N>`(where N
is a number I cannot control directly and depends on the number of input files)
Which doesn't really work for my use case
You should be able to directly set the segment name from the segment generation config
Are you talking about the
fixed
strategy ? Isn't it only able to handle a single input file ? (given that we provide a single [fixed] segment name)Jackie
07/02/2021, 10:40 PMsimple
strategy should generate segments with name <table>_<N>
(no OFFLINE
in between)
If you want to control the segment name, a work around should be using the fixed
strategy and generate one segment for each jobJackie
07/02/2021, 10:42 PMJonathan Meyer
07/02/2021, 11:04 PMfixed
type sounds like it should workrefresh
type table would work ? Would every precious segments be discarded whenever a new ''batch'' (i.e. job of multi files) of segment is generated ?