Hi < Seunghyun> thanks for your response I have succesfully Apache Pinot #getting-started

Hi <@UDXSFNZ9U>, thanks for your response. I have ...

Carlos

01/18/2023, 7:35 AM

Hi @Seunghyun, thanks for your response. I have succesfully connected Superset to Pinot. My problem is knowing if it is posible to dedupe data in offline tables.

Seunghyun

01/18/2023, 8:02 AM

Do we know why duplication happened? Does your original data contain the duplicated rows? Pinot’s offline ingestion usually naively converts the input data, convert into Pinot segments, and load that into the cluster. Data is not expected to change.

Seunghyun

01/18/2023, 8:05 AM

One common scenario for data duplication: If you use offline table, you need to generate the segments with the same segment name in order to avoid the data duplication when your original data is updated and you try to refresh the data in pinot for the corresponding source data. Pinot currently supports data backfill based on the segment name.

Seunghyun

01/18/2023, 8:05 AM

If you use realtime table, you can take a look on https://docs.pinot.apache.org/basics/data-import/dedup

Carlos

01/18/2023, 8:06 AM

Hi!!

Carlos

01/18/2023, 8:06 AM

The data I’m sending to Pinot are copies of my bussiness entitities (lets say users for example)

Carlos

01/18/2023, 8:07 AM

so in the application layer that data changes, and everytime i t happens a new json is pushed to the pipeline

Seunghyun

01/18/2023, 8:08 AM

new json = 1 row?

Carlos

01/18/2023, 8:08 AM

at this moment yes

Carlos

01/18/2023, 8:08 AM

but I need 1 json a new row

Seunghyun

01/18/2023, 8:09 AM

then probably you may need the upsert feature

Seunghyun

01/18/2023, 8:09 AM

https://docs.pinot.apache.org/basics/data-import/upsert

Seunghyun

01/18/2023, 8:09 AM

check the above document

Carlos

01/18/2023, 8:09 AM

upsery is only for realtime tables ¿right?

Seunghyun

01/18/2023, 8:09 AM

so whenever there’s a data change, you can emit the change to kafka

Carlos

01/18/2023, 8:10 AM

can I emmit partial updates?

Carlos

01/18/2023, 8:10 AM

or do I need to emit the whole piece of data?

Seunghyun

01/18/2023, 8:10 AM

check

*Partial upserts*

section

Seunghyun

01/18/2023, 8:10 AM

we do support partial

Carlos

01/18/2023, 8:10 AM

👍

Carlos

01/18/2023, 8:10 AM

Thanks very much for your help!

Seunghyun

01/18/2023, 8:11 AM

creating 1 pinot segment for each row is not a scalable model, we will end up having toooo many segments

Seunghyun

01/18/2023, 8:11 AM

upsert is probably the right choice for your application scenario

Carlos

01/18/2023, 8:11 AM

yes, that is what I thought

Seunghyun

01/18/2023, 8:11 AM

👍

Carlos

01/18/2023, 1:39 PM

Hi Again. One more question (short one I hope). I managed to get things working with a realtime table and some kafka topics. Is working as expected but no files are written to deep storage (with offline tables there where files, so it seems to be properly configured). Is this the expected behaviour? Pinot will send data to deep storage when it considers?

Lewis Yobs

01/18/2023, 1:59 PM

By default the RT tables stay in memory, ref:

<https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime>

. For other behaviors, you might look into hybrid tables:

<https://docs.pinot.apache.org/basics/components/table#hybrid-table>

Carlos

01/18/2023, 2:56 PM

👍

Carlos

01/18/2023, 2:56 PM

Thanks!

Seunghyun

01/18/2023, 8:42 PM

• upsert is not working with the hybrid table so we need to keep the realtime tables only • for realtime, we first keep data in memory and flush data based on the threshold ◦ e.g.

Copy code

"streamConfigs" : {
  "realtime.segment.flush.threshold.rows": "0",
  "realtime.segment.flush.threshold.time": "24h",
  "realtime.segment.flush.threshold.segment.size": "150M",
  "streamType": "kafka",
  "stream.kafka.consumer.type": "LowLevel",
  "stream.kafka.topic.name": "ClickStream",
  "stream.kafka.consumer.prop.auto.offset.reset" : "largest"
}

Carlos

01/20/2023, 9:37 AM

Hi again. I almost have reached my goals with Pinot, but there one las thing. I have a REALTIME table with a metric field. This realtime table ingest a Kafka topic and it has partial upserts cofigured. Every partial upsert works as expected excepting the metric field (an integer one configured as INCREMENT upsert. I have made queries to the the table with option(skipUpsert=true) to see the change history and i see that everytime the value at the metric column is 0, but i have sent many messages to the kafka topic geting partials upsert

Carlos

01/20/2023, 9:38 AM

Is there something special with metric upserts?

Carlos

01/20/2023, 10:42 AM

Even if i create a new row, metric fields are ignored a set to 0

Carlos

01/20/2023, 12:40 PM

I deleted tables schemas… everithing, rebooted services and it’s working

👍 1

Open in Slack

Previous Next