Eric Liu
04/11/2023, 4:26 PMfrancoisa
04/11/2023, 4:35 PMEric Liu
04/11/2023, 5:08 PMEric Liu
04/11/2023, 5:15 PMA
, which has the primary key c
, a boolean field d
(default is false)
, and some other fields
• I have another lower volume stream B
, which also has the same primary key c
and the boolean field d
.
• I want to have a real-time table that uses stream A as the source, and use the stream B
to update the boolean field d
based on the primary key.
Any suggestions how to achieve this?francoisa
04/11/2023, 5:38 PMNavina
04/11/2023, 7:24 PMB
), is the data mostly static, meaning are there going to be multiple updates for the same primary key ? I am trying to understand if this is an actual stream-stream join or more like a data enrichment usecase.Eric Liu
04/11/2023, 8:27 PMB
is also id
based event stream, not static. It can have multiple updates (at most 2 times for the same primary key, e.g. first time an event from B
comes in w/ d
as false, and later another event comes in as true
. The combinations of the possibility are true + false, true + true, or just true). The high volume stream A can be hundreds of GBs or TBs, and the low volume B might be ten to hundreds of GBs, but potentially to grow larger.
Some questions:
• How is the performance look like in general for joins at query time at this scale using either Trino or V2 query engine?
• How far is the V2 query engine from production ready?Navina
04/11/2023, 10:43 PMRong R
04/11/2023, 10:47 PMEric Liu
04/11/2023, 10:49 PMRong R
04/11/2023, 10:53 PMEric Liu
04/11/2023, 11:09 PMEric Liu
04/12/2023, 2:15 AMIf both table are partition by the primary join key it should be fairly fast.Rong, could you clarify a bit about the
primary join key
? In my case, the primary key field of the high volume table is a composite key, let’s say column1
+ column2
. For the low volume table, I can make the primary key be column1
+ column2
as well, or just be column1
by rolling it up. Do you mean I need to partition both tables by
• column1
and column2
• or just column1
since column1
is the primary join key?Eric Liu
04/12/2023, 7:32 AMThe intermediate stages of the multi-stage engine are running purely on heap memory, thus executing a large table join will cause potential out-of-memory errors
• Because of this, the table scan phase for join queries is limited to 10 million rows.Is this just a current limitation, or a long-term thing? Does the
scan phase
mean the rows being scanned is limited, or after the scanning the rows being fetched into the next phase is limited? (still trying to get familiar w/ Pinot’s terms…) If it’s the former one, then the table in my case is much larger than that scale. Would scaling servers help resolve/mitigate the issue?Rong R
04/12/2023, 4:28 PMRong R
04/12/2023, 4:28 PM