Apache Pinot #general

Shen Wan

09/10/2020, 9:20 PM

How can we make sure partitions are balanced? e.g. if I partition on a column, and some values are much more frequent than other values. which

functionName

should I use? And also how should I decide the

numPartitions

Adrian Cole

09/14/2020, 5:06 AM

does anyone have any hints to help me do this faster? I want to use local objects to configure schema instead of ReST api as done in AddTable command https://github.com/apache/incubator-pinot/issues/5977

Oliver

09/14/2020, 9:59 AM

Dear all, thank you very much for building Apache Pinot - it is indeed a great tool! One question that came up during our evaluation of Pinot is how to handle year over year (or period over period) comparisons in Pinot (or in the Viz Tooling). How would you typically do this? In SQL, one normally would either use WINDOW functions (like LAG with PARTITION OVER) oder self-JOINS. Any advice is highly appreciated 🙂 THANKS!

🙏 2

👍 1

Joey Pereira

09/15/2020, 5:20 AM

👍. Follow up, I set

instanceId=pinot-broker-1

and that looks like it did some "fun" things! The server's instance zk state is

Copy code

{
  "id": "pinot-broker-1",
  "simpleFields": {
    "HELIX_ENABLED": "true",
    "HELIX_ENABLED_TIMESTAMP": "1600146975179",
    "HELIX_HOST": "pinot-broker-1",
    "HELIX_PORT": ""
  },
  "mapFields": {},
  "listFields": {
    "TAG_LIST": [
      "DefaultTenant_BROKER"
    ]
  }
}

Based on a bit of splunking, it looks like the

instanceId

has to conform to a strict form of

<type>_<hostname>_<port>

for internals to work?

udk

09/15/2020, 4:24 PM

Hi, I am trying to injest data into Pinot. The csv file is about 30G. It has been running for about 5 hours and has not completed yet. Could someone let me know where I can find the logs for this process. I have the pinot cluster running in docker containers - Setup similar to one described here - https://docs.pinot.apache.org/basics/getting-started/advanced-pinot-setup.

Will Briggs

09/18/2020, 5:50 PM

Hi all! I'm doing some prototyping with Pinot for what seems like a perfect use-case (ingest Avro from Kafka into a single, large, table, and perform fast aggregation and filtering on it, for a rolling window of time). Just to get my feet wet, I've started trying to bootstrap the pinot-quickstart on Kubernetes via Docker Desktop. I've noticed that the

pinot-broker-0

pod is constantly spamming warnings into the log, because there is no zookeeper running:

Copy code

2020/09/18 17:48:20.691 WARN [ClientCnxn] [Start a Pinot [BROKER]-SendThread(pinot-zookeeper.pinot-quickstart.svc.cluster.local:2181)] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

Buchi Reddy

09/18/2020, 6:24 PM

A minor feedback on the new QueryConsole UI. 1. The previous version where results are displayed above the response stats was much more intuitive. 2. The response stats printed in JSON was much easier to read and understand than a table. It's a one row table always anyways so I think this should be JSON instead.

Shen Wan

09/18/2020, 7:14 PM

QQ: can we set pinot data explorer query timeout over 25 seconds?

Buchi Reddy

09/21/2020, 7:12 PM

quick question about the table retention: We are observing a behavior where we have 5days retention on a table but when we query, we're getting some records which are older than 5days too. Could this be happening because there are segments spanning the boundary of the 5days? Do we check the retention in the query path to make sure each record returned is within the retention window?

Shen Wan

09/22/2020, 8:05 PM

what’s in

columns.psf

? It’s a file ~300M.

Dharak Kharod

09/23/2020, 2:11 AM

Hi, I had a quick question on the multivalued columns, what is the expected behavior when a multivalued column is used as a group-by key or that is an invalid pattern and should not be used?

Sashikanth Damaraju

09/23/2020, 10:24 PM

👋 Hi pinot devs. I am using the open source docker images to host a local cluster. I noticed that a schema that used to work for the 0.4.0 version of the images is throwing a 400 bad request with the 0.5.0 version of the images (latest as of today). The schema I'm using is: https://pastebin.pl/view/a509e966 Where should I be looking for any specific stacktraces / logs?

docker logs -f <controller_container_id>

doesn't show anything relevant to the 400

Raghav

09/24/2020, 4:38 PM

Hi I'm exploring Pinot as a potential serving layer replacement of HDFS in our org.... Does writing orc files from s3 (5Tb in batches for now) everyday makes sense for Pinot?

Adam Haines

09/26/2020, 3:23 PM

Does Pinot offer any support for ldap security?

Adrian Cole

09/28/2020, 12:08 AM

anyone free to help discuss some technical topics on auto-schema creation? https://github.com/apache/incubator-pinot/pull/6039

Adrian Cole

09/28/2020, 4:40 AM

can I talk about slow startup? 😄

Hunter E

09/28/2020, 7:52 PM

Howdy! Apologies if anyone has asked this before, but we’ve been using Pinot for a little bit at our org and we are now talking a little bit about multi-regional availability for our Pinot setup (satisfying queries in other regions mainly) and was curious if anyone here has put some thought or setup into how to run a Pinot cluster in multiple regions (even if it’s a duplicate cluster in another region ingesting replicated source or something along those lines). Thanks!

SandishKumarHN

09/29/2020, 7:01 PM

anyone has seen this before?

Copy code

[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ pinot-spark-connector ---
[INFO] /Users/sandishkumarhn/git/incubator-pinot/pinot-connectors/pinot-spark-connector/src/main/scala:-1: info: compiling
[INFO] Compiling 16 source files to /Users/sandishkumarhn/git/incubator-pinot/pinot-connectors/pinot-spark-connector/target/classes at 1601397499238
[ERROR] java.lang.NoClassDefFoundError: scala/reflect/internal/Trees
[INFO] 	at java.lang.Class.getDeclaredMethods0(Native Method)
[INFO] 	at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
[INFO] 	at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
[INFO] 	at java.lang.Class.getMethod0(Class.java:3018)
[INFO] 	at java.lang.Class.getMethod(Class.java:1784)
[INFO] 	at scala_maven_executions.MainHelper.runMain(MainHelper.java:155)
[INFO] 	at scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)

Prakash Tirumalareddy

10/01/2020, 3:44 AM

Hello Everyone, Very quick question I have lots of data in s3 bucket (parquet format). Can I use Pinot to retrieve or query single record base on query or condition? Is Pinot right software to do such thing?

Shane Fitzroy

10/01/2020, 5:17 AM

Hi all, I work at an analytics start-up in Sydney Australia who've just landed series-A investment. And so we're now re-architecting a particular product for scale and to support more features - so going from an MVP monolith with a PostgresDB to a more distributed system. I'm looking to understand how Pinot may fit into a new stack and would like to know from an operations perspective how demanding a Pinot deployment might be, and how we might have to scale our engineering teams to support such a deployment in production. If the context helps, it's a multi-tenant application that orchestrates some ETL+ML data-pipelines with Spark/Databricks and the Gurobi optimiser. But essentially the output are "results" datasets (parquets on Azure data lake) comprised of 50M~100M rows and 30~40 columns. We're aiming at supporting at least 500 users spread over several customers/tenants. We expect concurrent users to be generating/experimenting with new datasets regularly throughout the day (hourly). One of the web applications will be a front-end with a UI "data-grid" where users will want to perform exploratory/interactive analysis on results, so aggregations/group-by/filtering/search/counting etc, at reasonably-low latency. On paper, Pinot looks like a great fit, but is it overkill for us? How many engineers would it take to support a deployment for our volume of data/ingestion? Note that ZooKeeper is not an existing part of our stack yet. Sorry for the wall of text. Any advice or experience from others here would be greatly appreciated. Cheers.

Igor Lema

10/01/2020, 1:19 PM

Hi everyone, I’m looking to leverage Pinot in a simple Analytics use-case: allowing distinct counts, funnel analysis and anomaly detection of user click events from our App Currently, our somewhat large company we are ingesting 100~200K events/second of 300 different (but defined) schemas , the biggest schema should have 40 columns but the majority are less than 20. In this mix, there are also at least 10% of late-events and duplicates. (2TB a day) Currently we reach for more than 500 users querying in an exploratory/interactive fashion over this data in our own front-end. With Pinot we hope to achieve sub-minute latency. Pinot looks the perfect fit for this use-case, since there is no need to join events, but my main doubt is how big should this infrastructure to support this volume? And how hard is going to support a deployment for this volume? I’m planning on deploying with K8s using S3 as segment store for Pinot. I also don’t need the Offline Server or any batch ingestion job

Wojtek Sznapka

10/05/2020, 7:18 AM

Hi guys, I'm looking into adopting Pinot in our organisation and it looks like good fit! The problem I'm trying to solve is to move away from using BigQuery for daily Superset dashboards and make use of Pinot in user-facing apps. I have pre-cubed data coming from Spark/Snappy data in Kafka and want to use it as source in Pinot. The only problem is the data is "append-only" (comes from Debezium), so we have all create and update records in one stream (let's say user bet's on sport, his ticket is created, has a state "accepted", then his ticket changes the state to winning or non-winning and we have another record in Kafka). In BQ we use

row_number over (partition by ticket_number order by source.lsn DESC)

which numbers rows with the 1 as the newest and then we look for rows numbers = 1 in sub-query (BQ docs). How would you solve it in Pinot, I didn't find windowing/analytic functions. Thanks in advance!

Prakash Tirumalareddy

10/05/2020, 2:23 PM

hello there, getting the following exception..

Caused by: java.lang.IllegalArgumentException: Parameter 'Bucket' must not be null

I am using 0.5.0 GenerationJobRunner, segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner, segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner} includeFileNamePattern: glob:**/*.parquet inputDirURI: s3://edp-pinot-data/nem13/ jobType: SegmentCreationAndUriPush outputDirURI: s3://edp-pinot-segments/nem13/segments overwriteOutput: true pinotClusterSpecs: - {controllerURI: 'http://localhost:9000'} pinotFSSpecs: - {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file} - className: org.apache.pinot.plugin.filesystem.S3PinotFS configs: {region: ap-southeast-2} scheme: s3 pushJobSpec: {pushAttempts: 1, pushParallelism: 1, pushRetryIntervalMillis: 1000, segmentUriPrefix: 's3://edp-pinot-segments', segmentUriSuffix: null} recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader, configClassName: null, configs: null, dataFormat: parquet} segmentNameGeneratorSpec: null tableSpec: {schemaURI: 'http://localhost:9000/tables/nem13/schema', tableConfigURI: 'http://localhost:9000/tables/nem13', tableName: nem13} Am I missing anything? Please help!!!

Neha Pawar

10/07/2020, 5:15 PM

Hey folks! This Data Council event is happening tomorrow, where @User and Pete Soderling will discuss Apache Pinot and where it fits in the analytics landscape. Please tune in at 9.30 AM PDT. Link to register: https://www.eventbrite.com/e/dc-thurs-apache-pinot-w-kishore-gopalakrishna-tickets-122050330825

🍷 13

👏 13

Vinu Thomas

10/07/2020, 5:54 PM

hows pinot different from druid?? or kylin??

Vinu Thomas

10/09/2020, 10:41 AM

how do you start contributing code to pinot??

👍 2

Buchi Reddy

10/09/2020, 5:45 PM

quick question: Does pinot support custom aggregation functions as UDFs? For example, when I'm aggregating results of a group, can I pick the latest record by timestamp?

Cesar

10/09/2020, 10:46 PM

Hey folks, I'm doing some GC profiling of Pinot 0.40 and 0.50 and I'm noticing a huge difference between the memory allocation rate of these two versions. Everything in my setup is fixed, except the version of Pinot used. The Pinot 0.4 server process is allocating on average 4.4GB/s while Pinot 0.5 server is allocating less than 1GB/s on average. Does such huge difference make sense to you? Do you know of a code change that could have caused such a big impact? I'm using Ubuntu 20, Java 1.8 with G1GC -Xmx12G. I'm using the TPCH data set and JMeter to send 1M 'select * from tpch_lineitem' queries to Pinot.

alec goldis

10/11/2020, 5:16 PM

newbie question: I am looking to move away from SSAS, but most of my data consumers are using Excel. What is user's experience in Excel when connecting to Pinot?

Adrian Cole

10/12/2020, 5:05 AM

suggestion to whoever is in build eng. nix the random JDKs?