Apache Pinot #getting-started

Saoirse Amarteifio

09/11/2021, 1:58 PM

Hello - im just getting setup and would like to get some advice... I have added Pinot via Helm on EKS and i will want to (a) ingest parquet from S3 and (b) stream data from MSK(Kafka 2.2.1 on the same VPC) and probably requiring SSL 1. I added a simple schema and table spec - they look ok 2. I (think) i configure deep storage for S3 3. Batch ingestion - Really i am interested in any recommended (standalone) way to ingest data from S3 (watching folders on some interval) and im not sure if the docs are providing what i need? Is there an example of posting directly to the controller? 4. I would like to try with Kafka too in this case im just wondering about the configs (in thread). I am feeling my way though and in this case if anyone has a sample config that would be nice to see for this MSK setup but i expect ill get there with some trial and error a. I was a little put off but the mention of needing to update the pom for Kafka version 2.2.1 and i was not really sure if that was indeed needed or how i would do that via Helm

Slackbot

09/14/2021, 8:34 AM

This message was deleted.

Dan DC

09/14/2021, 3:41 PM

Hey, is there a way to tag servers upon deployment/start up? I.e. via config files inatead of using the API?

09/16/2021, 10:44 AM

Do You have any idea?

Tiger Zhao

09/16/2021, 6:22 PM

Any tips for debugging slow queries? I was stress testing my cluster, and noticed a behavior where when I send a bunch of queries at once, the query latency goes from ~100ms to 4-5 seconds. The latency then stays relatively high for a few minutes after the stress test and then returns back to ~100ms. I also noticed behavior where sometimes a single server would take significantly longer to process a query, which ends up increasing the overall latency by a lot. That one slow server also stays consistently slow for a while, so every query is bottlenecked by that server. Thanks!

xtrntr

09/21/2021, 6:05 AM

hello, i’ll like to clarify the usage of dimension tables - can i use the columns in

dimTable

but not

factTable

to filter in the WHERE clause? https://docs.google.com/document/d/1InWmxbRqwcqIakzvoEWHLxtX4XR9H5L01256EbAUHV8/edit#

Copy code

Table factTable:
string    uuid
int       metric
timestamp event_time
string    status

Copy code

Table dimTable:
string uuid
string name 
string country

Copy code

SELECT
  f.uuid,
  d.name,
  d.country,
  abs(sum(m.metric)) as sum_metric
FROM
  factTable f join dimTable d on f.uuid = d.uuid
WHERE 
  d.country in ('USA')
GROUP BY
  1,
  2,
  3
ORDER BY
  2

arun muralidharan

09/21/2021, 3:47 PM

Hello Folks, Can someone point to me a document about how segments are read from both local storage and deep storage ? Can the cluster automatically recover from deep storage when local segment store is cleared ? I want to basically know how the read/write path is in the presence and absence of deep storage.

Tiger Zhao

09/22/2021, 9:14 PM

Does pinot support features like the

with

clause, or views?

Tiger Zhao

10/01/2021, 8:20 PM

Is there a way to view the number of nodes that are generated for a star tree? (I'm exploring various indexing configs and was wondering how different setups affects the storage and performance)

Dan DC

10/07/2021, 2:50 PM

Hey, I've seen somewhere that pinot have some special columns with metadata about the row segment path and other stuff. I don't seem to find that anywhere and I wonder if someone could kindly point me at where they are documented

Neha Pawar

10/08/2021, 8:22 PM

it has to be a new name, you cannot transform a column and put it into the same name

👍 1

Saoirse Amarteifio

10/11/2021, 5:12 PM

Im running my first batch ingestion job ingestion from S3 parquet files - the task was kicked off and the 8 rows of the input sample are read but then it fails and im not sure what the error message is telling me ... what is the illegal argument in this context? I did not get any closer looking at the source for Segment Name Generator...

Copy code

RecordReader initialized will read a total of 8 records.
at row 0. reading next block
block read in memory in 1 ms. row count = 8
Start building IndexCreator!
Finished records indexing in IndexCreator!
Failed to generate Pinot segment for file - <s3://bucket/samples/data/myData/test.parquet>
java.lang.IllegalArgumentException: null
        at shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:108) ~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-11f8550b9b2881ede4d105416ed970a5dd708463]
        at org.apache.pinot.segment.spi.creator.name.SimpleSegmentNameGenerator.generateSegmentName(SimpleSegmentNameGenerator.java:53) ~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-11f8550b9b2881ede
4d105416ed970a5dd708463]

Can anyone suggest what illegal thing i am doing from this error message? adding jobSpec in thread...

Saoirse Amarteifio

10/11/2021, 8:43 PM

When i query presto when there is a column with a reserved keyword like

timestamp

even though the spec for presto suggests that it can be escaped with double quotes, i cannot seem to submit a query that includes

"timestamp"

It might be specific to the clients I am using; i have tried the presto-cli freshly downloaded and a python client and both result in a PQLParsingError. What to do in this situation? (this is testing the presto-pinot connector but maybe not a Pinot question for this channel)

Sharon Akinyi

10/13/2021, 8:18 AM

Hello I am new in using Apache Pinot. I am trying to learn more about Pinot operators. Would anyone help me in getting to unserstand how it works and how to go about it?

Courage Noko

10/13/2021, 7:56 PM

hey, I deployed pinot on Kubernetes, is there a way to set Google Cloud Storage configs such as

pinot.controller.storage.factory.gs.projectId

on the server/controller during deployment or update these?

Neha Pawar

11/02/2021, 6:22 PM

if you done have a schema registry, you need to provide schema as a config in the stream config @User

Priyank Bagrecha

11/02/2021, 6:25 PM

one more question - do i need to keep port 8098 and 8099 open on server and broker nodes? i am setting everything up manually right now.

Priyank Bagrecha

11/02/2021, 10:34 PM

I finally got it working. Thanks a ton for all the help. Had to wrangle with the schema json a bit but finally victory!

👍 2

tyler dobbs

11/03/2021, 4:54 AM

Been trying to just start Pinot locally in a docker container. I'm using pinot version

0.8.0

and

openjdk:11

. I'm on a mac. I'm trying to start the cluster by using the pinot admin commands

StartZookeeper

StartController

StartBroker

and

StartServer

as shown in the getting started. However inevitably the controller will go down before I can start the Broker and the Server with this error:

Expiring session 0x100080c84b20005, timeout of 30000ms exceeded

, Is there a way to avoid this?

Priyank Bagrecha

11/09/2021, 12:56 AM

Can someone please point me to documentation for

enableDefaultStarTree

and

enableDynamicStarTreeCreation

fields in the table confi? I want to understand what does a default / dynamic star-tree index mean.

Priyank Bagrecha

11/09/2021, 9:56 PM

Oh no theta sketch either?

Priyank Bagrecha

11/11/2021, 9:11 PM

link for

Optimizing Scatter and Gather

is broken on https://docs.pinot.apache.org/operators/operating-pinot/tuning

Neha Pawar

11/15/2021, 3:55 PM

You don't need the group id or any of the properties that say "hlc". Your tables might be out of sync because you've set offset criteria "largest". Each table will start consuming from the last message in the topic, so if your rate of events is high, second table will miss out on events that were emitted between creation of first and second table

Priyank Bagrecha

11/15/2021, 9:40 PM

The link for

Transform Function in Aggregation Grouping

is broken on https://docs.pinot.apache.org/users/user-guide-query/querying-pinot#udf. I am guessing it should be pointing to https://docs.pinot.apache.org/users/user-guide-query/supported-transformations.

xtrntr

11/16/2021, 6:09 AM

will using

IdSet

with “NOT IN” clause have any unintended performance impact? e.g.

select * from table where userid not in IDSET(...)

Priyank Bagrecha

11/18/2021, 9:02 AM

both https://downloads.apache.org/pinot/apache-pinot-0.8.0/apache-pinot-0.8.0-bin.tar.gz and https://downloads.apache.org/pinot/apache-pinot-incubating-0.7.1/apache-pinot-incubating-0.7.1-bin.tar.gz are returning 404

Priyank Bagrecha

11/19/2021, 7:25 AM

i am noticing that disk on a controller instance starts filling up pretty fast. what can i do to slow it down?

Diana Arnos

11/19/2021, 2:10 PM

Hello there 👋 I'm developing something that uses Pinot, consuming straight from a new kafka topic. I was able to run everything I need and it is beautiful (thanks for the work on this project 💪 ) Now I'm trying to improve some things on my project and wondered if there is a way to use a schema registry instead of leaving the table schema inside the project itself. What I would like to happen: I have a json schema related to the topic Pinot will consume from and instead of manually editing/creating the table schema (as explained here in the docs), I would like for Pinot to read the JSON schema from my registry and automagically use it when ingesting. I'm not sure if the configs

stream.kafka.decoder.prop.schema.registry.rest.url

and

stream.kafka.decoder.prop.schema.registry.schema.name

could help me achieve this.

👋 1

Pavel Stejskal

11/29/2021, 7:29 PM

Hello! I’ve got a question related to simple use case. Currently we have a Hadoop cluster for netflow ingestion ~ 320 TB data. Ingestion is from Kafka via Spark app directly to Hive (external table - simple parquet files). Searching in stored data is via Spark. Table is partitoned by hour but still we’re missing indexes. I’d like to replace current flow with Apache Pinot, but I’m not sure about segment store. We need to keep HDFS as a data backend and from documentation it seems like Pinot needs store data locally. We’re targeting to hybrid table, e.g. keep 1 hour from real time Kafka topis and older data to be pulled from HDFS. My questio is: a) real-time part of data need local disks - every Pinot server is holding a part of data from Kafka (consumer in group), right? b) hour + 1 data are stored “optimized” and indexed locally and pushed to HDFS? c) When I query data, current data are pulled from local semgment, older data are pulled in lazy fashion from HDFS/s3? d) is possible to host 200 TB table with ~ 12 columns (half nums, half strings) with @ 6 Pinot servers and get some benefit from indexes, just be more efficient than Spark with partition pruning?

Luis Fernandez

12/01/2021, 7:27 PM

has anyone tried to move the segments folder if you use google to bigquery? to do big data computations that may not be possible in pinot?