Hi team I am trying to build an upsert realtime table with h Apache Pinot #getting-started

Hi team, I am trying to build an upsert realtime t...

Sonit Rathi

11/12/2022, 10:12 AM

Hi team, I am trying to build an upsert realtime table with hashing function. I have a composite key. Do I have to implement the same hashing function for partitioning in kafka or can I use one individual key out of the composite key for partitioning?

saurabh dubey

11/14/2022, 5:29 AM

Any one of the columns from the composite key can be used to partition the stream. The idea is simply to make sure all records belonging to a given composite key, always reach the same server. Partitioning the stream by any one of the columns from the composite keys achieves that. cc: @Kartik Khare

Kartik Khare

11/14/2022, 5:30 AM

yes, that is correct. Just partitioning by any one of the keys in Kafka should be good enough.

Kartik Khare

11/14/2022, 5:31 AM

Also, create a good enough number of kafka partitions as it is not possible to increase partitions later in an upsert table

✔️ 1

Sonit Rathi

11/14/2022, 7:21 AM

@Kartik Khare @saurabh dubey Thanks for the confirmation. Have implemented the same. But the upsert table takes up too much heap memory. Any way we can move this key map logic for upsert to disk memory?

Kartik Khare

11/14/2022, 7:22 AM

Not for now. That feature is WIP. How many unique keys do you have and how much memory it is taking?

Sonit Rathi

11/14/2022, 7:24 AM

i have 5 32 gb servers. it has a total of 70 million records with replication as 2. It has consumed about 30-40% memory in each server

Kartik Khare

11/14/2022, 7:27 AM

Hmm, what is the primary key you are using? Based upon my calculcations the upsert metadata shouldn't take more than 3-4GB total (70 million records * (32 + 20)) I have assumed keys to be of average 32 bytes the value stored is always 20 bytes. You can multiply the result by 2 to account for replication.

Sonit Rathi

11/14/2022, 7:34 AM

primary key is a composite key so have used MURMUR3 hashfunction for it

Kartik Khare

11/14/2022, 7:36 AM

In that case, it shouldn't use more than 10GB total since we use 128 bit version of murmur3

Sonit Rathi

11/14/2022, 7:39 AM

this is the count of segments per server

Sonit Rathi

11/14/2022, 7:39 AM

the one with most count has used 48% memory and with the least has consumed 17% memory

Kartik Khare

11/14/2022, 7:42 AM

Can you also run the following query on the table and tell me the count

SELECT COUNT(*) FROM table

SELECT COUNT(*) FROM table option(skipUpsert = true)

Sonit Rathi

11/14/2022, 7:44 AM

for skipUpsert = 83395586 with upsert = 70766256

Kartik Khare

11/14/2022, 7:45 AM

cool. Now can you tell me the following • Xmx and direct memory values for server processes • The 48% memory is total memory, heap memory or off-heap memory? • The number of partitions in kafka topic

Sonit Rathi

11/14/2022, 7:46 AM

have set xmx as 24gb

Sonit Rathi

11/14/2022, 7:47 AM

have 20 partitions in kafka

Sonit Rathi

11/14/2022, 7:48 AM

only server is running in this vm

saurabh dubey

11/14/2022, 8:01 AM

Can you try

Copy code

jmap -histo:live <java pid>

inside the server to check what is the exact memory usage by each class instance? Specifically bytes used by RecordLocation?

Kartik Khare

11/14/2022, 8:01 AM

do a

| head -10

as well. It will be a long list :P

Sonit Rathi

11/14/2022, 10:54 AM

jmap isn't working. I am using oracle jdk 11. I guess it's not supported in it

Kartik Khare

11/14/2022, 5:22 PM

does jcmd work?

Abhijeet Kushe

05/24/2023, 6:22 PM

@Sonit Rathi if you are have a composite then why is murmur3 correct ? Asking as we have composite key too and was not able to find an mention of that in the pinot docs

Open in Slack

Previous Next