Apache Pinot #getting-started

Kishore G

07/01/2021, 7:53 PM

added instructions to quickstart pinot from IDE as well

👍 1

Bruce Ritchie

07/13/2021, 4:26 PM

Question on cluster and node sizing. I have 30b rows with ~90 columns (5+TB of parquet files) to ingest into pinot. qps once ingested is likely < 10/sec. Is there a document outlining sizing recommendations for various node types?

Matt Landers

08/30/2021, 4:10 PM

set the channel topic: New to Pinot? Start here: https://www.youtube.com/playlist?list=PLihIrF0tCXdeimVCZwuejXb7FkjsyN9_k

👍 3

Luis Fernandez

09/01/2021, 5:01 PM

hey friends, I have a need in my current project to do stats for ads, (impressions, click_count, click_spent) etc…, now my client has many dimensions they may want to look stuff by (locale, user_id, search query, device etc) … we currently track all of this data thru kafka and was thinking about using pinot to make this data queryable, the user facing dashboard looks at this data by set timeranges and also custom time ranges, I was wondering if pinot is a good candidate for given problem. Right now i’m working in a POC with pinot so would appreciate any insights 🙂 thank you!

xtrntr

09/05/2021, 6:27 PM

do dimension tables support upsert? i plan to update the dimension table on a daily basis

Kishore G

09/05/2021, 6:31 PM

If it’s small enough, use refresh and update the entire table

👍 1

xtrntr

09/07/2021, 10:31 PM

if i wish to use the native java client but only can have my broker/controller exposed outside of the cluster, is my only option to use

ConnectionFactory.fromHostList(brokerUrl)

? im not all that familiar with ZK and i dont see a way in the API to retrieve broker addresses from the zookeeper category of APIs exposed by the controller https://docs.pinot.apache.org/users/clients/java

Xiang Fu

09/08/2021, 1:46 AM

you need to expose broker externally then use broker list to query

09/16/2021, 10:44 AM

Hello friends, I want to test the ThirdEye solution for Pinot anomaly detection, so I followed the documentation https://docs.pinot.apache.org/integrations/thirdeye, but failed to connect to http://localhost:1426/

arun muralidharan

09/21/2021, 3:47 PM

Thanks in advance.

Kamal Chavda

10/08/2021, 8:11 PM

When using ingestionconfig > transformconfigs, does it HAVE to be a new column for the transformFunction? I would like to transform an existing column from source and keep the same column name for the Pinot table instead of creating a new column.

Priyank Bagrecha

11/02/2021, 6:09 AM

hello, i am just getting started. i am trying to consume avro records from a 2.x kafka stream which doesn't use schema registry. does this look correct?

Copy code

"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.KafkaAvroMessageDecoder",
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory"

table status says bad in cluster manager and i am trying to figure out what i am missing. i am looking at the code in github, and seems like i need to provide schema for parsing however there is a comment saying not to use schema as it will be dropped in future release. any pointers will be greatly appreciated. thanks in advance!

Priyank Bagrecha

11/02/2021, 6:18 AM

should i use

SimpleAvroMessageDecoder

? even that one has the same comment

Copy code

Do not use schema in the implementation, as schema will be removed from the params

Priyank Bagrecha

11/02/2021, 6:34 AM

I am using version 0.7.1 with Java 8

Niteesh Hegde

11/02/2021, 10:35 AM

Hi, I am new to pinot Can I ingest data to pinot from postgres logs?

Priyank Bagrecha

11/02/2021, 6:07 PM

this one didn't work either. :(

Neha Pawar

11/02/2021, 6:24 PM

"stream.kafka.decoder.prop.schema" : "<your avro schema here>"

Priyank Bagrecha

11/02/2021, 6:24 PM

got it. thanks!

Orbit

11/08/2021, 9:44 PM

@User has left the channel

Priyank Bagrecha

11/09/2021, 1:14 AM

Also what happens when I update star-tree index configs in scenarios like - adding a new dimension to

dimensionsSplitOrder

or even removing one - what happens to the index and the segments? same for

functionColumnPairs

. I am thinking of editing as adding a new one and dropping the old one.

Priyank Bagrecha

11/09/2021, 7:27 PM

Does the query console only show limited results for a query? I am wondering why I am seeing only some rows in results to query like

Copy code

SELECT col1, col2, col3, DISTINCTCOUNT(col4) AS distinct_col4
FROM   table
GROUP  BY col1, col2, col3

the star-tree index looks like

Copy code

"starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "col1",
          "col2",
          "col3"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "DISTINCTCOUNT__col4"
        ],
        "maxLeafRecords": 1
      }
    ],

can i also add

DistinctCountHLL__col4

and

DistinctCountThetaSketch__col4

functionColumnPairs

and evaluate the performance for all 3 for this query?

Jackie

11/09/2021, 9:05 PM

Startree only supports

distinctcounthll

because it's intermediate result size is bounded

Jackie

11/09/2021, 9:05 PM

You need to add

limit

to the query, or it defaults to 10

Priyank Bagrecha

11/09/2021, 9:56 PM

And thank you Jackie!

Priyank Bagrecha

11/15/2021, 9:46 AM

hello. i started two pinot clusters with both of them consuming from the same kafka cluster and same topic. one pinot cluster is using inverted index on the same set of fields that the other one uses for star-tree index. so basically two pinot tables where the only difference is that first one uses inverted index while second one uses star-tree index. i created tables at the same time so i am assuming that both start consuming from the kafka topic at the same time. when i issue same query to both tables one after another, i see that

totalDocs

is 2x/3x for table with inverted index in comparison to table with star-tree index. if it matters, i started querying tables after ~5-10 mins of creating them. i also confirmed this by running

Copy code

select count(*) from <table_name>

is this expected?

Priyank Bagrecha

11/15/2021, 10:11 AM

i noticed that

group.id =

(basically empty) as so maybe both pinot tables are using the same group id.

Priyank Bagrecha

11/15/2021, 12:05 PM

i tried using

Copy code

"streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowLevel",
      "stream.kafka.topic.name": <topic_name>,
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.SimpleAvroMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": <broker_list>,
      "realtime.segment.flush.threshold.size": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.desired.size": "50M",
      "stream.kafka.consumer.prop.auto.offset.reset": "largest",
      "stream.kafka.consumer.prop.group.id": <group_id>,
      "stream.kafka.decoder.prop.schema": <schema>
    }

and

Copy code

"streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "highLevel",
      "stream.kafka.topic.name": <topic_name>,
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.SimpleAvroMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.hlc.bootstrap.server": <broker_list>,
      "realtime.segment.flush.threshold.size": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.desired.size": "50M",
      "stream.kafka.consumer.prop.auto.offset.reset": "largest",
      "stream.kafka.consumer.prop.group.id": <group_id>,
      "stream.kafka.decoder.prop.schema": <schema>
    }

and

Copy code

"streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "highLevel",
      "stream.kafka.topic.name": <topic_name>,
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.SimpleAvroMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.hlc.bootstrap.server": <broker_list>,
      "realtime.segment.flush.threshold.size": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.desired.size": "50M",
      "stream.kafka.consumer.prop.auto.offset.reset": "largest",
      "stream.kafka.consumer.prop.hlc.group.id": <group_id>,
      "stream.kafka.decoder.prop.schema": <schema>
    }

and none of those worked. finally after looking at code i tried

Copy code

"streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.consumer.type": "lowLevel",
        "stream.kafka.topic.name": <topic_name>,
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.SimpleAvroMessageDecoder",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.broker.list": <broker_list>,
        "stream.kafka.consumer.prop.auto.offset.reset": "largest",
        "stream.kafka.group.id": <group_id>,
        "stream.kafka.decoder.prop.schema": <schema>,
        "realtime.segment.flush.threshold.size": "0",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.desired.size": "50M"
      },

and that was able to consume from kafka but i don't see it in the list of kafka consumer groups. logs still say group.id is empty. any help / pointers are appreciated.

Priyank Bagrecha

11/15/2021, 12:24 PM

also tried

Copy code

"streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.consumer.type": "highLevel",
        "stream.kafka.topic.name": <topic_name>,
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.SimpleAvroMessageDecoder",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.hlc.bootstrap.server": <broker_list>,
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.hlc.group.id": <group_id>,
        "stream.kafka.decoder.prop.schema": <schema>,
        "realtime.segment.flush.threshold.size": "0",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.desired.size": "50M"
      },

but it doesn't consume any events from kafka at all.

Neha Pawar

11/15/2021, 5:09 PM

@User ^

Caesar Yao

06/06/2023, 2:29 AM

Hello everyone, does pinot support NFS as the deep store?