Hi <@UDXSFNZ9U>, thanks for your response. I have ...
# getting-started
c
Hi @Seunghyun, thanks for your response. I have succesfully connected Superset to Pinot. My problem is knowing if it is posible to dedupe data in offline tables.
s
Do we know why duplication happened? Does your original data contain the duplicated rows? Pinot’s offline ingestion usually naively converts the input data, convert into Pinot segments, and load that into the cluster. Data is not expected to change.
One common scenario for data duplication: If you use offline table, you need to generate the segments with the same segment name in order to avoid the data duplication when your original data is updated and you try to refresh the data in pinot for the corresponding source data. Pinot currently supports data backfill based on the segment name.
If you use realtime table, you can take a look on https://docs.pinot.apache.org/basics/data-import/dedup
c
Hi!!
The data I’m sending to Pinot are copies of my bussiness entitities (lets say users for example)
so in the application layer that data changes, and everytime i t happens a new json is pushed to the pipeline
s
new json = 1 row?
c
at this moment yes
but I need 1 json a new row
s
then probably you may need the upsert feature
check the above document
c
upsery is only for realtime tables ¿right?
s
so whenever there’s a data change, you can emit the change to kafka
c
can I emmit partial updates?
or do I need to emit the whole piece of data?
s
check
*Partial upserts*
section
we do support partial
c
👍
Thanks very much for your help!
s
creating 1 pinot segment for each row is not a scalable model, we will end up having toooo many segments
upsert is probably the right choice for your application scenario
c
yes, that is what I thought
s
👍
c
Hi Again. One more question (short one I hope). I managed to get things working with a realtime table and some kafka topics. Is working as expected but no files are written to deep storage (with offline tables there where files, so it seems to be properly configured). Is this the expected behaviour? Pinot will send data to deep storage when it considers?
l
By default the RT tables stay in memory, ref:
<https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime>
. For other behaviors, you might look into hybrid tables:
<https://docs.pinot.apache.org/basics/components/table#hybrid-table>
c
👍
Thanks!
s
• upsert is not working with the hybrid table so we need to keep the realtime tables only • for realtime, we first keep data in memory and flush data based on the threshold ◦ e.g.
Copy code
"streamConfigs" : {
  "realtime.segment.flush.threshold.rows": "0",
  "realtime.segment.flush.threshold.time": "24h",
  "realtime.segment.flush.threshold.segment.size": "150M",
  "streamType": "kafka",
  "stream.kafka.consumer.type": "LowLevel",
  "stream.kafka.topic.name": "ClickStream",
  "stream.kafka.consumer.prop.auto.offset.reset" : "largest"
}
c
Hi again. I almost have reached my goals with Pinot, but there one las thing. I have a REALTIME table with a metric field. This realtime table ingest a Kafka topic and it has partial upserts cofigured. Every partial upsert works as expected excepting the metric field (an integer one configured as INCREMENT upsert. I have made queries to the the table with option(skipUpsert=true) to see the change history and i see that everytime the value at the metric column is 0, but i have sent many messages to the kafka topic geting partials upsert
Is there something special with metric upserts?
Even if i create a new row, metric fields are ignored a set to 0
I deleted tables schemas… everithing, rebooted services and it’s working
👍 1