https://pinot.apache.org/ logo
Join Slack
Powered by
# general
  • a

    Alex

    11/16/2019, 9:07 AM
    good question 😕
  • k

    Kishore G

    11/16/2019, 4:32 PM
    Bad query from presto connector - GC - session timeout - kafka consumption stopped
  • k

    Kishore G

    11/16/2019, 4:33 PM
    Kafka consumption not restarting after GC is a bug. We need to look into it.
  • e

    Elon

    11/16/2019, 7:06 PM
    When we go to the servers the query runs fine, it's when we hit the broker that happens
  • a

    Alex

    11/17/2019, 2:30 AM
    Was it a broker? I could not reproduce it btw, even on much bigger dataset
  • a

    Alex

    11/18/2019, 4:59 PM
    another question -> during the ingestion load test we noticed that zookeper log folder grew to 8 Gbs. we wrote about 300M messages into Pinot (2 tables), with a rate of 30K messages per second.
  • a

    Alex

    11/18/2019, 5:00 PM
    Log files are pretty big as well-> up to 3 gbs. is it a normal behavior? In production, what is the typical size of the zoo’s data logs folder?
  • a

    Alex

    11/18/2019, 5:00 PM
    Copy code
    zookeeper@pinot-zookeeper-0:/data/log/version-2$ ls -lh
    total 8.9G
    -rw-rw-r-- 1 zookeeper zookeeper  65M Nov 16 01:37 log.100000001
    -rw-rw-r-- 1 zookeeper zookeeper 3.1G Nov 16 19:10 log.200000001
    -rw-rw-r-- 1 zookeeper zookeeper 1.8G Nov 16 23:02 log.20000df77
    -rw-r--r-- 1 zookeeper zookeeper 4.1G Nov 17 21:21 log.300000001
  • k

    Kishore G

    11/18/2019, 5:09 PM
    What’s the real-time table config? How often are segments getting created?
  • a

    Alex

    11/18/2019, 5:26 PM
    Copy code
    {
      "tableName": "flattened_orders_hours",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "updatedAtHours",
        "timeType": "HOURS",
        "retentionTimeUnit": "DAYS",
        "retentionTimeValue": "365",
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "schemaName": "flattened_orders_hours",
        "replication": "1",
        "replicasPerPartition": "1"
      },
      "tenants": {
        "broker": "DefaultTenant",
        "server": "DefaultTenant"
      },
      "tableIndexConfig": {
        "loadMode": "MMAP",
        "invertedIndexColumns": [
          "...",
          "...",
          "...",
          "..."
        ],
        "aggregateMetrics": "true",
        "streamConfigs": {
          "streamType": "kafka",
          "stream.kafka.consumer.type": "highLevel",
          "stream.kafka.topic.name": "flattened-orders-json-seconds",
          "stream.kafka.decoder.class.name": "org.apache.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory",
          "stream.kafka.hlc.zk.connect.string": "IP:2181/",
          "stream.kafka.zk.broker.url": "IP:2181/",
          "stream.kafka.broker.list": "IP:9092",
          "stream.kafka.isolation.level": "read_committed",
          "stream.kafka.hlc.bootstrap.server": "IP:9092",
          "realtime.segment.flush.threshold.time": "3600000",
          "realtime.segment.flush.threshold.size": "50000",
          "stream.kafka.consumer.prop.auto.offset.reset": "earliest",
          "stream.kafka.consumer.prop.group.id": "pinot-flattened_orders_hours"
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
  • a

    Alex

    11/18/2019, 5:26 PM
    we loaded 90 days of data in kafka topic, and then blasted it in a loop into Pinot cluster.
  • a

    Alex

    11/18/2019, 5:27 PM
    which means same hour will be written multiple times (is it a good idea for the load test ? )
  • k

    Kishore G

    11/18/2019, 5:29 PM
    Copy code
    flush threshold size is 50000, its too low.
  • k

    Kishore G

    11/18/2019, 5:29 PM
    yes, thats fine
  • a

    Alex

    11/18/2019, 5:30 PM
    Copy code
    50000
    what should it be?
  • k

    Kishore G

    11/18/2019, 5:32 PM
    ideally 150 to 500mb is a sweet spot.
  • k

    Kishore G

    11/18/2019, 5:32 PM
    whats the current size of the segment
  • n

    Neha Pawar

    11/18/2019, 5:34 PM
    you could try using segment size threshold instead of rows/time: https://pinot.readthedocs.io/en/latest/tuning_realtime_performance.html#controlling-number-of-rows-in-consuming-segment
  • a

    Alex

    11/18/2019, 5:38 PM
    checked 1 server (running 3 in this setup). segment dir is empty. index dir:
  • a

    Alex

    11/18/2019, 5:38 PM
    Copy code
    4.0K	./consumers
    3.7M	./flattened_orders_hours_REALTIME_1573926928266_0__0__1573996522185/v3
    3.7M	./flattened_orders_hours_REALTIME_1573926928266_0__0__1573996522185
    3.6M	./flattened_orders_hours_REALTIME_1573926928266_0__0__1573939395196/v3
    3.6M	./flattened_orders_hours_REALTIME_1573926928266_0__0__1573939395196
    3.7M	./flattened_orders_hours_REALTIME_1573926928266_0__0__1573993404560/v3
    3.7M	./flattened_orders_hours_REALTIME_1573926928266_0__0__1573993404560
  • n

    Neha Pawar

    11/18/2019, 5:47 PM
    Copy code
    "realtime.segment.flush.threshold.time": "24h",
    "realtime.segment.flush.threshold.size": "0",
    This should enable segment size based threshold. By default, the algorithm tried to create segments of 200M.
  • k

    Kishore G

    11/18/2019, 5:48 PM
    Thanks @User. Whats the default setting when none of them are set?
  • s

    Subbu Subramaniam

    11/18/2019, 5:54 PM
    The settings mentioned by @User does not work for HLC. From your segment names, it appears you are using LLC. The only things you can tune there are the number of rows and time, adjust them according to your use case. It will be interesting to know what your reasons are to go with HLC, however. We strongly recommend you use LLC, since all the new algorithms etc. have been on that mode.
  • a

    Alex

    11/18/2019, 6:08 PM
    @User still in exploring different options. LLC has hard to hit requirement:
    Copy code
    Events with higher offsets should be more recent (the offsets of events need not be contiguous)
  • k

    Kishore G

    11/18/2019, 6:13 PM
    @User that’s true with Kafka rt
  • a

    Alex

    11/18/2019, 6:16 PM
    oh, so both HL and LL need this guarantee? What will happen if some events are out of order? Do we need to run some on stream sorting job before sending to Pinot?
  • k

    Kishore G

    11/18/2019, 6:22 PM
    Kafka guarantees that within a partition offsets are monotonically increasing
  • s

    Subbu Subramaniam

    11/18/2019, 6:30 PM
    @User HL does not know about offsets. It expects the consuming layer to keep track of offsets (or any other way) to ensure that messages from the stream are consumed exactly once to Pinot. LLC, on the other hand, keeps track of offsets, and ensures that Pinot consumes every message in the partition exactly once. The offset (an int or long) is provided by the underlying stream on a per-partition basis. Kafka provides one.
  • a

    Alex

    11/18/2019, 6:31 PM
    @User got it
  • a

    Alex

    11/18/2019, 6:32 PM
    @User true but event time is created by message producers, so time can me mixed (messages with higher offsets can have a lower timestamp). Maybe I misunderstanding
    Copy code
    Events with higher offsets should be more recent
1...979899...160Latest