Apache Pinot #general

balci

06/24/2021, 10:25 PM

Curious if there are folks from Slack here. Maybe Slack would be kind enough to donate a Pro tier for Pinot Slack workspace so the message history could be kept for a bit longer.

👍 1

➕ 4

sriramdas sivasai

06/25/2021, 5:24 AM

It was great event happened yesterday. Many of my colleagues are missed to attend the event. is there any where we can get the recordings ??

👍 4

Hemavathi

06/25/2021, 8:18 AM

Hi, We have tried to utilize the Azure Blob Storage for inputDirURI and outputDirURI while submitting the spark job to push the data into pinot server. We faced the below issue. Exception thrown while calling mkdir (uri=wasbs://test@dev1.blob.core.windows.net/test/segments/, errorStatus =409) com.azure.storage.file.datalake.models.DataLakeStorageException: Status code 409, {"error":{"code":"EndpointUnsupportedAccountFeatures","message":"This endpoint does not support BlobStorageEvents or SoftDelete. Please disable these account features if you would like to use this endpoint. It works fine if we disable the “softdelete” option, but we need the softdelete feature in our blob storage. It seems to be pinot code supported only ADLS Gen2, which created on top of blob storage. So the blob credentials works when create the ADLS Gen2 object. However, SoftDelete and BlobStorageEvents are unsupported feature in ADLS Gen2 object. Therefore, we got the above error. Is it possible to support Azure Blob Storage with SoftDelete and BlobStorageEvents in pinot?

John

06/28/2021, 5:38 PM

Hi Everyone, I am trying to integrate Kerberos Hadoop with Pinot.and using below configurations. Executables: export HADOOP_HOME=/usr/hdp/2.6.3.0-235/hadoop export HADOOP_VERSION=2.7.3.2.6.3.0-235 export HADOOP_GUAVA_VERSION=11.0.2 export HADOOP_GSON_VERSION=2.2.4 export GC_LOG_LOCATION=/home/hdfs/Pinot/pinotGcLog export PINOT_VERSION=0.7.1 export PINOT_DISTRIBUTION_DIR=/home/hdfs/apache-pinot-incubating-0.7.1-bin export HADOOP_CLIENT_OPTS="-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" export SERVER_CONF_DIR=/home/hdfs/apache-pinot-incubating-0.7.1-bin/bin export ZOOKEEPER_ADDRESS=<ZOOKEEPER_ADDRESS> export CLASSPATH_PREFIX="${HADOOP_HOME}/hadoop-hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/lib/gson-${HADOOP_GSON_VERSION}.jar" export JAVA_OPTS="-Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:${GC_LOG_LOCATION}/gc-pinot-server.log" controller.conf controller.data.dir=<fs.defaultFS>/user/hdfs/controller_segment controller.local.temp.dir=/home/hdfs/Pinot/pinot_tmp/ controller.zk.str=<ZOOKEEPER_ADDRESS> controller.enable.split.commit=true controller.access.protocols.http.port=9000 controller.helix.cluster.name=PinotCluster pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS pinot.controller.storage.factory.hdfs.hadoop.conf.path=/usr/hdp/2.6.3.0-235/hadoop/conf pinot.controller.segment.fetcher.protocols=file,http,hdfs pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle='hdfs@HDFSSITHDP.COM' pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab='/home/hdfs/hdfs.keytab' pinot.controller.storage.factory.hdfs.hadoop.kerberos.principle='hdfs@HDFSSITHDP.COM' pinot.controller.storage.factory.hdfs.hadoop.kerberos.keytab='/home/hdfs/hdfs.keytab' controller.vip.port=9000 controller.port=9000 pinot.set.instance.id.to.hostname=true pinot.server.grpc.enable=true Kerbeous Information: kinit -V -k -t /home/hdfs/hdfs.keytab hdfs@HDFSSITHDP.COM Using default cache: /tmp/krb5cc_57372 Using principal: hdfs@HDFSSITHDP.COM Using keytab: /home/hdfs/hdfs.keytab Authenticated to Kerberos v5 ERROR MESSAGE: END: Invoking TASK controller pipeline for event ResourceConfigChange::15fc3764_TASK for cluster PinotCluster, took 278 ms START AsyncProcess: TASK::TaskGarbageCollectionStage END AsyncProcess: TASK::TaskGarbageCollectionStage, took 0 ms Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Trying to authenticate user 'hdfs@HDFSSITHDP.COM' with keytab '/home/hdfs/hdfs.keytab'.. Could not instantiate file system for class org.apache.pinot.plugin.filesystem.HadoopPinotFS with scheme hdfs java.lang.RuntimeException: Failed to authenticate user principal ['hdfs@HDFSSITHDP.COM'] with keytab ['/home/hdfs/hdfs.keytab'] at org.apache.pinot.plugin.filesystem.HadoopPinotFS.authenticate(HadoopPinotFS.java:258) ~[pinot-hdfs-0.7.1-shaded.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] Caused by: java.io.IOException: Login failure for 'hdfs@HDFSSITHDP.COM' from keytab '/home/hdfs/hdfs.keytab': javax.security.auth.login.LoginException: Unable to obtain password from user. at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:962) ~[pinot-orc-0.7.1-shaded.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at org.apache.pinot.plugin.filesystem.HadoopPinotFS.authenticate(HadoopPinotFS.java:254) ~[pinot-hdfs-0.7.1-shaded.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] ... 15 more Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:901) ~[?:1.8.0_241] at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:764) ~[?:1.8.0_241] at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_241] at org.apache.pinot.plugin.filesystem.HadoopPinotFS.authenticate(HadoopPinotFS.java:254) ~[pinot-hdfs-0.7.1-shaded.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] ... 15 more Failed to start a Pinot [CONTROLLER] at 21.954 since launch java.lang.RuntimeException: java.lang.RuntimeException: Failed to authenticate user principal ['hdfs@HDFSSITHDP.COM'] with keytab ['/home/hdfs/hdfs.keytab'] at org.apache.pinot.spi.filesystem.PinotFSFactory.register(PinotFSFactory.java:58) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] P.s. I am executing this hdfs user and for keytab file also user is hdfs .I have also given 777 access to hdfs.keytab file. someone Kindly suggest What is the issue here.I have read multiple blocks and everywhere found that it is because of wrong prinicpal/keytab file combination/user don't have access/give 777 access to file/try with different user. tried all the the options but nothing worked as of now.

Zsolt Takacs

06/29/2021, 6:49 AM

The docs say:

Pinot only allows adding new columns to the schema. In order to drop a column, change the column name or data type, a new table has to be created.

Does this also mean that moving a field i.e. from dimensions to metrics is not possible?

Karin Wolok

06/29/2021, 12:56 PM

👋 Heyyyyy to all out new community members! 👋 Welcome! We're happy to have you! Curious who you are and what got found Pinot? 🍷 @User @User @User @User @User @User. @User @User @User @User @User @User

🙌 1

Luis Muñiz

06/29/2021, 1:00 PM

I asked a question in #C01H1S9J5BJ but I'm not sure that was the appropriate channel

Luis Muñiz

06/29/2021, 1:13 PM

Why is Pinot not represented in the Wikipedia comparison page of OLAP servers? https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers

N. Mert Aydin

06/29/2021, 2:20 PM

Hello everyone. It’s so awesome to be here with you. Looking forward to jumping in after a quick review of threads 😉

👋 3

❤️ 1

Sajjan Kumar

06/30/2021, 5:24 AM

Hey Karin, I was looking for near real-time reporting tech for our system and Pinot seems great. We can need other tools along with pinot. So it good to be here guys.

Kewei Shang

07/01/2021, 4:15 PM

Hi @User, a question about real-time table’s upsert feature: would it be possible to use other columns (e.g. an int column or Kafka offset) than the event-time column (configured by

timeColumnName

segmentsConfig

table config) to decide which is the last version when the primary key is the same? I’m asking because Kafka Streams JOIN produces many records with the same event-time column value (in our case, the

last_update

column). So in Pinot, the last version is randomly chosen amongst these records with the same event-time. Thanks

Kishore G

07/01/2021, 6:28 PM

Hello All, We need your inputs as we are thinking about the big features to add in Pinot. we have avoided implementing joins in Pinot and have always referred folks to use Presto/Spark to achieve joins on top of Pinot. However, we are seeing contributions from Uber on lookup join and requests from users to support native join support in Pinot. Is this something that will benefit existing users of Pinot. How do you handle joins • 1️⃣ We dont need it since we pre-join the data before pushing it to Pinot • 2️⃣ We use Presto/Trino and we are happy with Presto/Trino • 3️⃣ We would LOVE to see Pinot support JOIN Please vote

1️⃣ 1

3️⃣ 9

2️⃣ 2

Liran Brimer

07/03/2021, 7:00 PM

Hi everyone, In the homepage you wrote that the db is suitable when working on immutable data. However, I do see that you support upserts as well. https://docs.pinot.apache.org/basics/data-import/upsert So I'm a bit confused whether it fits mutable table data or not..

Soumya Dey

07/04/2021, 7:29 PM

Hi Everyone, We are evaluating Apache Pinot for our analytical use case. We have encountered some scenarios for which we didn't get proper justification yet. Please help us to understand the reasoning behind them & how to address those scenarios; 1. Why insert to Pinot table via Presto connector is not supported as almost all other SQL commands are supported ? 2. Why updating records using update query is not allowed on Pinot table via Presto ? 3. If we want to replicate same set of data values in a Pinot table how to do it at present without Kafka Ingestion ? Ex: Existing 1M records we want to multiple by insert into TableA ( select * from TableA ). As Presto connector not allowed to insert into table and Pinot itself doesn't support subqueries, hence those 2 options are not there. 4. If we made some mistake adding column name during schema creation, and later update the schema, will the previous ingested data values for that column will automatically considered ? Ex: Realtime table has 1 column called "NAME", which is supposed to be mentioned as "name". So as the Kafka stream data, previously ingested have values for "name" attribute, so after schema change will Pinot automatically update values for all rows or we need to retrofit "name" values again ? If need to retrofit, what is the best possible way ? 5. Can a single query read from both REALTIME & OFFLINE tables ? As subqueries & joins are not supported directly by Pinot, is there any way, we can achieve that ?

troywinter

07/05/2021, 3:34 AM

How do I efficiently compute a distinct count for column a that is depend on the value of column b, like this:

Copy code

select FLOOR(10000*count(distinct case when action = 'pay' then user_id else '' end) / count(distinct case when action = 'browse' then user_id else '' end))/100 from "capi_trace" where company_id = 'aaaaa' and __time >= '1622554026000' and __time <= '1625146026000' and shop_id in ('xxxxx')

Is it possible without using case when inside distinct count?

Soumya Dey

07/05/2021, 4:42 AM

Hi Everyone, I want to know about compression ratio in Apache Pinot. For Example, If I have a 10GB of JSON file containing records, having 100 columns, to save it in Pinot server, how much memory is required (considering there will be only 1 replica) ? Also in-memory segments gets flushed to segment store once threshold reached. So how much storage should be provisioned for deep store in controller ?

Eugene Ramirez

07/05/2021, 7:32 AM

Hi all, What is the timeline for Pinot 0.8? Do you have the roadmap and timelines written somewhere?

➕ 2

Ken Krugler

07/06/2021, 6:03 PM

We generate OFFLINE segments via Hadoop, and sometimes these are updates to existing segments. In that case we want the segment names to match exactly (so that it’s an update). For most segments this is fine, as we partition by month. But there are cases where we also sub-partition by a non-date field. In this situation I don’t see a way to leverage the

SegmentNameGenerator

interface to give us a deterministic name. If we could key off of the input (CSV) file name then it would be easy, as we’ve got full control over that. Any ideas?

Josh Highley

07/06/2021, 6:36 PM

if a table exists for multiple tenants, is it possible to restrict query results to a single tenant?

Karin Wolok

07/07/2021, 4:58 PM

👋 Welcome to all the new 🍷 Pinot members! We're happy to have you! Can you tell us a little about who you are and what brought you here? 😃 @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User

🙌 1

❤️ 1

Josh Highley

07/07/2021, 7:45 PM

I want to confirm this: if I create a table 'accounts' for TenentA, I cannot create another table 'accounts' for TenantB. Is that correct? I would have to create unique names 'accounts_tenantA' and 'accounts_tenantB' ?

sriramdas sivasai

07/08/2021, 11:33 AM

Hello every one, I have doubt about the Storage and query part of the pinot. Suppose if we hav 6 months of data as pinot segments in deep storage (size of 500gb) and if i want to make the aggregate query on last 6th months data. 1. does my offline data server should have 500gb memory(RAM) to processs the query ?? or Even with 100gb ram and storage of 500gb, the queries will work efficiently ?? 2. Also does my query work, if i didnt have the storage of 500gb ? 3. memory required for loading segment file from disk is same as the size of the file ? i meant, because of loading the compressed file to memory will blow up the ram 3-4x. Also If want to read the single record from previous 6 months, will it do on demand segment loading from deep storage ??

Liran Brimer

07/08/2021, 12:55 PM

Hi everyone, we are evaluating Pinot and one of our requirements is to be able to encrypt our client's data on the disk (in memory it can be decrypted). is such a thing possible? and if so, we may also need to encrypt it with a different encryption key per client (each client's data would be encrypted with a unique key dedicated to that client). is there a way to achieve that? thank you so much

Carlos Domínguez

07/08/2021, 9:44 PM

Copy code

Hi guys!
I have a question regarding Kafka integration with Pinot
If I'm using a secured Kafka using SASL_SSL. Is there any way of configuring that and use those credentials? Or there is another way of setting security from Pinot to Kafka for data ingestion?
Thanks in advance!

👋 2

sunil

07/09/2021, 7:09 AM

Hi Guys ..can we integrate apache ranger into Apache Pinot

Ryan Clark

07/09/2021, 4:44 PM

I am trying to plug in an AWS Kinesis data stream, but I am unable to get authenticated because we use temporary credentials, which requires an AWS_SESSION_TOKEN to be given. Is there a way to give a session token along with the key and secret? Here is part of my table config.

Copy code

"streamConfigs": {
          "streamType": "kinesis",
          "stream.kinesis.topic.name": "stream-name",
          "region": "us-east-1",
          "accessKey": "sdfsdf",
          "secretKey": "sdfdsf",

Liran Brimer

07/11/2021, 8:11 AM

Hi, does Pinot have limitation on the tables amount? would it support millions of tables?

07/11/2021, 2:16 PM

What could be the possible reason of a bad segment.? Whenever I am pushing some segments from hdfs to pinot table , the job is executing without any error also segments are getting created in pinot. But status is showing bad. I.e. I have started all the pinot components ( server, broker, controller) on different nodes, and executing this Hadoop ingestion job on one different node where my Hadoop client is installed.? Is it mandatory to start one pinot server also on the same node, where I m executing this Hadoop ingestion job.

sriramdas sivasai

07/11/2021, 7:59 PM

Hi everyone, we are evaluating the druid and pinot for one our usecase. just to understand, is the

startTree index

in pinot and

roll-up

in druid are same ? or anything different ?

Yupeng Fu

07/12/2021, 6:30 PM

partial upsert will be released in 0.8