https://pinot.apache.org/ logo
Join Slack
Powered by
# feat-partial-upsert
  • j

    Jackie

    04/15/2021, 12:04 AM
    When bootstraping the table
  • j

    Jackie

    04/15/2021, 12:05 AM
    Actually the bootstrap could also be done via kafka
  • j

    Jackie

    04/15/2021, 12:06 AM
    The reason we have to do segment replacement is because Pinot is append only, but that is not the case with upsert
  • j

    Jackie

    04/15/2021, 12:06 AM
    So actually it might make sense to have everything through kafka for upsert table
  • y

    Yupeng Fu

    04/15/2021, 12:07 AM
    There is usually 10-20x throughout difference between direct push vs going through kafka
  • y

    Yupeng Fu

    04/15/2021, 12:08 AM
    In our past observation, backfill via kafka may take days whereas going with direct push may just take a couple of hours
  • j

    Jackie

    04/15/2021, 12:12 AM
    The initial bootstrap can be done via direct push, then the updates come through kafka
  • j

    Jackie

    04/15/2021, 12:13 AM
    The problem here is that we should not change the history for upsert table. Also it is better to pay the extra cost at write time instead of query time
  • j

    Jackie

    04/15/2021, 12:17 AM
    Also, if we can make the assumption that a primary key won't get any update after a period of time (say 3 days for an order), we might be able to flush it to an offline table
  • y

    Yupeng Fu

    04/15/2021, 12:29 AM
    Bootstrap makes sense
  • y

    Yupeng Fu

    04/15/2021, 12:30 AM
    If we cannot easily change history, then we shall consider a feature to replace an existing table
  • y

    Yupeng Fu

    04/15/2021, 12:30 AM
    Think from user perspective how the data correction flow shall be like
  • j

    Jackie

    04/15/2021, 12:35 AM
    To correct the record for a primary key, simply put the desired record with the current (latest) timestamp
  • j

    Jackie

    04/15/2021, 12:36 AM
    Hmm, for different scenario, user might want to put different timestamp for the update message
  • j

    Jackie

    04/15/2021, 12:36 AM
    But the update logic should be general enough to handle all these custom logic
  • q

    Qiaochu Liu

    05/22/2021, 2:09 AM
    hello @User i used the lastest master branch and run quick-start-streaming.sh demo, and observed the following error. is it possible there is a potential bug in the RealtimeQuickStart producer?
    Copy code
    ➜  apache-pinot-incubating-0.8.0-SNAPSHOT-bin git:(master) bin/quick-start-streaming.sh
    ***** Starting Kafka *****
    ***** Starting meetup data stream and publishing to Kafka *****
    ***** Starting Zookeeper, controller, server and broker *****
    May 21, 2021 7:06:37 PM org.glassfish.grizzly.http.server.NetworkListener start
    INFO: Started listener bound to [0.0.0.0:9000]
    May 21, 2021 7:06:37 PM org.glassfish.grizzly.http.server.HttpServer start
    INFO: [HttpServer] Started.
    May 21, 2021 7:06:46 PM org.glassfish.grizzly.http.server.NetworkListener start
    INFO: Started listener bound to [0.0.0.0:8000]
    May 21, 2021 7:06:46 PM org.glassfish.grizzly.http.server.HttpServer start
    INFO: [HttpServer-1] Started.
    May 21, 2021 7:06:53 PM org.glassfish.grizzly.http.server.NetworkListener start
    INFO: Started listener bound to [0.0.0.0:7500]
    May 21, 2021 7:06:53 PM org.glassfish.grizzly.http.server.HttpServer start
    INFO: [HttpServer-2] Started.
    ***** Bootstrap meetupRSVP table *****
    ***** Waiting for 5 seconds for a few events to get populated *****
    ***** Realtime quickstart setup complete *****
    Total number of documents in the table
    Query : select count(*) from meetupRsvp limit 1
    Exception caught: 
    java.lang.NullPointerException: null
    	at org.apache.pinot.tools.Quickstart.prettyPrintResponse(Quickstart.java:113) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-08b909c45e85f9bf8d8659561a2d13b4cc443ebc]
    	at org.apache.pinot.tools.RealtimeQuickStart.execute(RealtimeQuickStart.java:111) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-08b909c45e85f9bf8d8659561a2d13b4cc443ebc]
    	at org.apache.pinot.tools.admin.command.QuickStartCommand.execute(QuickStartCommand.java:147) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-08b909c45e85f9bf8d8659561a2d13b4cc443ebc]
    	at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:166) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-08b909c45e85f9bf8d8659561a2d13b4cc443ebc]
    	at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:186) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-08b909c45e85f9bf8d8659561a2d13b4cc443ebc]
    	at org.apache.pinot.tools.RealtimeQuickStart.main(RealtimeQuickStart.java:50) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-08b909c45e85f9bf8d8659561a2d13b4cc443ebc]
  • q

    Qiaochu Liu

    05/22/2021, 2:13 AM
    http://localhost:9000/#/query loaded successfully, but no data shown up in the table meetupRsvp
  • y

    Yupeng Fu

    05/22/2021, 4:12 AM
    Is this from a fresh master checkout?
  • j

    Jackie

    05/23/2021, 5:41 AM
    @User Can you try rebuild the project and see if the problem still exist?
  • j

    Jackie

    05/23/2021, 5:42 AM
    It should not have problem because it is part of the CI test on github
  • q

    Qiaochu Liu

    05/24/2021, 6:55 PM
    @User @User let me try reproduce this issue and see if still exist
  • y

    Yupeng Fu

    07/07/2021, 5:16 PM
    @User @User @User @User @User fyi, I added some context on the need of segment reader for partial upsert table https://github.com/apache/incubator-pinot/issues/7036
  • j

    Jackie

    07/07/2021, 6:24 PM
    So basically leveraging partial upsert to merge the incomplete records?
  • j

    Jackie

    07/07/2021, 6:25 PM
    By directly reading the pinot segment, all the records will be returned, instead of the ones with the latest timestamp. @User Do you think it suits your use case?
  • y

    Yupeng Fu

    07/07/2021, 6:27 PM
    can the segment reader support upsert?
  • j

    Jackie

    07/07/2021, 6:35 PM
    No, the upsert is not handled in the reader level
  • j

    Jackie

    07/07/2021, 6:37 PM
    In order to use Pinot features, or push down filters, the table dump should be modeled as a query IMO
  • j

    Jackie

    07/07/2021, 6:38 PM
    We can potentially use the gRPC streaming server to reduce the memory footprint
  • y

    Yupeng Fu

    07/07/2021, 6:51 PM
    true, you need to read from server to use the upsert metadata
  • y

    Yupeng Fu

    07/07/2021, 6:51 PM
    i wonder if it’s easy to write a Hive ETL to do the compaction: read the latest record of the pk