Apache Pinot #general

Zsolt Takacs

03/26/2021, 9:32 AM

This part in the documentation about Stream Ingestion with Upsert says that the input stream has to be partitioned by the primary key. Is this a literal requirement, or it just means that messages of the same PK must be in the same partition, so it can be achieved with a different key too? ie. using only a subset of pk fields, which is coarser partitioning but still has the property of having all messages of a PK in one partition.

Mayank

03/27/2021, 6:07 PM

For refresh, you need to ensure that the names and number of segments generated and pushed each time of refresh matches the names and number of segments in Pinot.

Kevin P

03/29/2021, 11:28 PM

Hello there, I am from France but currently living in Québec. I think I discovered Pinot a few monthes ago through Kenny's Bastani tweets, sadly I didn't really put in practise Pinot as I didn't find time for it 😞 (I mean, to get further than "Getting started" 😁 ) But I came to some meetups and read some docs/articles because I am really interested in it ! I wish I will find some time to use it for real 🤞

🍷 4

👋 4

Josh Highley

03/30/2021, 9:27 PM

Copy code

If there are multiple controllers, Pinot expects that all of them are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or ADLS

I can't find more info about this. Would any mountable NFS work? S3 for example?

Akash

03/31/2021, 10:56 AM

How does Pinot behaves in case there are huge number of tables for events ( say 2k ) ? I have a feeling that ZK can be a bottleneck in this situation ?

Dianjin Wang

04/01/2021, 3:16 AM

Hi Pinot community members, I’m from StreamNative, now working at organizing the Pulsar Virtual Summit NA 2021. I’m trying to get in touch with Pinot community and invite as one community partner of our Summit. As a community partner, The Pinot logo will be featured on the Pulsar Summit website, in promotional materials of the event, and also will be in Opening Keynote at the Summit. I think Pinot will gain valuable mindshare of a targeted audience. Also, welcome to submit talks on Apache Pulsar + Pinot. I’m not sure this is the right place to talk about this. So, Looking forward to your reply. Thanks a lot!

🍷 1

Srini Kadamati

04/01/2021, 1:25 PM

hello from the Apache Superset community! 👋 We’re hosting a fun event with @User on April 13th on using Trino <> Superset to join data from Pinot and Mongo pinot Would love to see y’all there! https://preset.io/events/2021-04-13-visualizing-mongodb-and-pinot-data-using-trino

🍷 1

🚀 1

➿ 2

🐇 1

Ting Chen

04/02/2021, 12:27 AM

@User for column transformation, https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations#column-transformation, (1) for a hybrid table, is the ingestionConfig required in both realtime and offline tables (2) can the transformation be applied to existing data? can we reload the table to do it?

Oguzhan Mangir

04/02/2021, 1:53 PM

Can we pass hdfs path to

jobSpecFile

config for reading job spec instead of local path?

Copy code

${SPARK_HOME}/bin/spark-submit \\
  --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \\
  --master "local[2]" \\
  --deploy-mode client \\
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \\
  --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" \\
  local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \\
  -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/sparkIngestionJobSpec.yaml

like;

Copy code

-jobSpecFile <hdfs://bucket/pinot-specs/sparkIngestionJobSpec.yaml>

Kevin Vu

04/02/2021, 4:25 PM

Hello All, I started looking into Apache Pinot for a company usecase. We would like to read rows from Cassandra tables and insert them into Pinot within Apache Flink. Seems like from what I have read in the documentation so far, I would have to write a Custom Batch Segment Writer. Is there any way I can this without writing a Custom Writer and instead do a push into Pinot, like using JDBC insert statements for example?

Phil Fleischer

04/02/2021, 5:42 PM

hey, i am playing with presto/pinot and when doing simple count queries i’m hitting rowcount maximums, anyone know why the aggregation isn’t delegated to pinot?

Mike Davis

04/02/2021, 5:55 PM

When generating offline segments is there any recommendations around target segment size?

Karin Wolok

04/06/2021, 5:42 PM

Hello new Pinot members! 🍷 Welcome to the community! 👋 Who are you? What brought you to the community? How did you first hear about Pinot? @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User

👏 3

Oguzhan Mangir

04/06/2021, 6:45 PM

Hello, i have a few questions about hdfs deep storage. We are working on AliCloud, and we want to use Alibaba Object Storage Service(OSS) as hdfs deep storage. We are trying to configure hdfs deep storage to use OSS as deep storage. We are working on Kubernetes. Is there anyone working with pinot on alibaba cloud? Or can anyone help us for that? Also i wonder that does pinot controllers and servers tries to connect deep storage while pods are creating? Because we set deep storage configs and put jars after pod is created

Maheshdatta Mishra

04/08/2021, 11:38 AM

I am new and trying to see if pinot is suitable for my use case. I have a requirement where I need different indexes for different teams to support different query pattern on the same table. I am planning to use offline table. is there a way I can generate the segments once for the tables with different indexing pattern?

Josh Highley

04/08/2021, 9:14 PM

when using a replicated ZooKeeper environment, can -zkAddress accept multiple URLs or does it require a load balancer?

Ricardo Bernardino

04/09/2021, 8:32 AM

Hi! We are checking the realtime ingestion with upsert and we have some questions around it • Can we have a retention period of say 6 months? • Is there a significant impact on this for the upsert logic? • If we add new servers, are the partitions correctly spread to the new servers?

Zsolt Takacs

04/09/2021, 12:20 PM

I was looking for the

HIGHEST_KAFKA_OFFSET_CONSUMED

HIGHEST_STREAM_OFFSET_CONSUMED

metric for monitoring stream ingestion lag, and found that it was removed as part of https://github.com/apache/incubator-pinot/issues/5359 Is there an alternative way to monitor the stream ingestion?

Vaibhav Sinha

04/09/2021, 1:52 PM

Hi everyone. I am planning to experiment with Pinot for the user facing analytics use cases we have. Our scale is not too large (~ 1M DAU) and we have a small team of 3 engineers working on data engineering for the first time. We primarily use managed services on AWS. With Pinot, one of the concerns is self managing the infrastructure and I wanted to know how has been the experience of others in this regard.

Aaron Wishnick

04/09/2021, 3:54 PM

Is pinot-admin.sh the preferred way to upload batch data or is there a REST API for that too?

Chad Preisler

04/09/2021, 10:01 PM

What version of the JDK is required to run Pinot? When building, the tests don’t run correctly with JDK 11 and above. That makes me wonder what JDK I can run with.

Prashant Kumar

04/12/2021, 8:55 PM

Hi all, is there a possibility in pinot to create a table from another table with some schema changes and import partial data to the newer one ?

Akash

04/13/2021, 2:51 PM

I am trying to start pinot with hdfs as deep storage but getting error while starting the server

Copy code

bin/start-server.sh -zkAddress pinot1.plan:2181,pinot2.plan:2181,pinot3.plan:2181 -configFileName conf/server.conf

and server config are

Copy code

pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
hadoop.conf.path=/local/hadoop/etc/hadoop/
pinot.server.storage.factory.hdfs.hadoop.conf.path=/local/hadoop/etc/hadoop/
pinot.server.segment.fetcher.protocols=file,http,hdfs
pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.server.instance.dataDir=/home/akashmishra/hpgraph/apache-pinot-incubating-0.6.0-bin/data/PinotServer/index
pinot.server.instance.segmentTarDir=/home/akashmishra/hpgraph/apache-pinot-incubating-0.6.0-bin/data/PinotServer/segmentTar

Akash

04/13/2021, 2:53 PM

In UI documentations its written

Copy code

-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs

Oguzhan Mangir

04/13/2021, 3:23 PM

Helloo, We've tried to use AliCloud OSS (like S3 in Amazon) as Pinot deep storage. There is no pinot-oss deep storage plugin right now but we are able to use OSS as pinot deep storage using the pinot hdfs file system plugin. We created a documentation for that; https://docs.pinot.apache.org/users/tutorials/use-oss-as-deep-storage-for-pinot

Aaron Wishnick

04/13/2021, 3:54 PM

If I understand right, when I batch ingest a set of parquet files, the job will create a segment for each parquet file and then will upload it all to Pinot? Is that right? If so, are there any guidelines about picking segment sizes for optimal query performance?

Aaron Wishnick

04/13/2021, 3:56 PM

Also when I run the batch ingestion job I see some debug output about dictionary encoding the columns, including numeric metric columns. Does that mean it's dictionary encoding the data in Pinot's internal format? Say I'd like to compute averages and quantiles of these metrics grouped by different dimensions -- is dictionary encoding best for that or should I disable it? Or is what I'm seeing not relevant to query performance

Ting Chen

04/13/2021, 5:51 PM

@User @User do you know *JSONPATHARRAY*(jsonField, 'jsonPath') can be used in a WHERE clause to find out if the array contains a certain value?

Aaron Wishnick

04/13/2021, 7:13 PM

If I've already created a table and batch ingested data, can I add a star-tree index after the fact or do I need to start from scratch?

Aaron Wishnick

04/13/2021, 7:29 PM

If I have SUM and COUNT in the star tree index's

functionColumnPairs

, will

AVG

implicitly be able to use the star tree index or do I need to put

AVG

in that list too?