Apache Pinot #general

Monica

03/09/2022, 8:11 AM

Hey everyone, is there any configuration to let inverted index, bloom filter, etc persist in a segment? If so, for a segment, will server use less memory size when reading inverted index, like server can only hold inverted index or bloom filter for a segment in memory?

Pavel Stejskal

03/09/2022, 8:13 AM

Hello! What’s efficient way to filter records in case of multi-valued colomuns, e.g. List<String>, n=4 and we want to filter all documents by 1 value in value set. Is forward and inverted index efficient here? Or is better to split mutli vals column to more columns? What’s recommended design for: 1. for fix N, 2. for variable N (e.g. 1 to 10) Thank you

Monica

03/09/2022, 9:21 AM

Hey everyone, I found pinot text index only support standard analyzer, is there any plan to support custom analyzers, like elasticsearch ? Or could you give me some advice how to support it better if we do this feature?

Bordin Suwannatri

03/10/2022, 5:12 AM

hi i try to use hdfs with pinot follow document --> https://docs.pinot.apache.org/basics/getting-started/hdfs-as-deepstorage it not working ###my config## ####controller config## pinot.service.role=CONTROLLER pinot.cluster.name=pinot-uat controller.host=pinot-uat01 controller.data.dir=hdfs://path/in/hdfs/for/controller/segment controller.local.temp.dir=/tmp/pinot/data/controller controller.zk.str=172.19.131.1162181,172.19.131.1172181,172.19.131.118:2181 controller.enable.split.commit=true controller.access.protocols.http.port=9000 controller.helix.cluster.name=pinot-uat pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS pinot.controller.storage.factory.hdfs.hadoop.conf.path=/etc/hadoop/conf pinot.controller.segment.fetcher.protocols=file,http,hdfs pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=hdptest@TRUE.CARE pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=/data/apache-pinot/keytab/hdptest.keytab controller.vip.host=pinotuat.true.care controller.vip.port=9000 controller.port=9000 pinot.set.instance.id.to.hostname=true pinot.server.grpc.enable=true ########Executable export HADOOP_HOME=/usr/lib/hadoop export HADOOP_VERSION=2.6.0-cdh5.16.2 export HADOOP_GUAVA_VERSION=11.0.2 export HADOOP_GSON_VERSION=2.2.4 export GC_LOG_LOCATION=/data/apache-pinot/logs/ export PINOT_VERSION=0.8.0 export PINOT_DISTRIBUTION_DIR=/data/apache-pinot export SERVER_CONF_DIR=/data/apache-pinot/conf export ZOOKEEPER_ADDRESS=172.19.131.1162181,172.19.131.1172181,172.19.131.118:2181 export CLASSPATH_PREFIX="${HADOOP_HOME}/client/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/client/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/client/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/client/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/client/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/client/gson-${HADOOP_GSON_VERSION}.jar" export JAVA_OPTS="-Xms8G -Xmx12G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-controller.log" ${PINOT_DISTRIBUTION_DIR}/bin/start-controller.sh -configFileName ${SERVER_CONF_DIR}/pinot-controller.conf ###########error log#### 2022/03/10 120641.771 INFO [StartControllerCommand] [main] Executing command: StartController -configFileName /data/apache-pinot/conf/pinot-controller.conf 2022/03/10 120641.843 INFO [StartServiceManagerCommand] [main] Executing command: StartServiceManager -clusterName pinot-uat -zkAddress 172.19.131.1162181,172.19.131.1172181,172.19.131.118:2181 -port -1 -bootstrapServices [] 2022/03/10 120641.843 INFO [StartServiceManagerCommand] [main] Starting a Pinot [SERVICE_MANAGER] at 0.012s since launch 2022/03/10 120641.847 INFO [StartServiceManagerCommand] [main] Started Pinot [SERVICE_MANAGER] instance [ServiceManager_poc-pinot01_-1] at 0.016s since launch 2022/03/10 120641.848 INFO [StartServiceManagerCommand] [main] Starting a Pinot [CONTROLLER] at 0.016s since launch WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/lib/hadoop/hadoop-auth-2.6.0-cdh5.16.2.jar) to method sun.security.krb5.Config.getInstance() WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/htrace/core/Tracer$Builder at org.apache.hadoop.fs.FsTracer.get(FsTracer.java:42) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2803) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:186) at org.apache.pinot.plugin.filesystem.HadoopPinotFS.init(HadoopPinotFS.java:65) at org.apache.pinot.spi.filesystem.PinotFSFactory.register(PinotFSFactory.java:52) at org.apache.pinot.spi.filesystem.PinotFSFactory.init(PinotFSFactory.java:72) at org.apache.pinot.controller.BaseControllerStarter.initPinotFSFactory(BaseControllerStarter.java:518) at org.apache.pinot.controller.BaseControllerStarter.setUpPinotController(BaseControllerStarter.java:358) at org.apache.pinot.controller.BaseControllerStarter.start(BaseControllerStarter.java:308) at org.apache.pinot.tools.service.PinotServiceManager.startController(PinotServiceManager.java:123) at org.apache.pinot.tools.service.PinotServiceManager.startRole(PinotServiceManager.java:93) at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.lambda$startBootstrapServices$0(StartServiceManagerCommand.java:233) at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startPinotService(StartServiceManagerCommand.java:285) at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startBootstrapServices(StartServiceManagerCommand.java:232) at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.execute(StartServiceManagerCommand.java:182) at org.apache.pinot.tools.admin.command.StartControllerCommand.execute(StartControllerCommand.java:149) at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:166) at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:186) at org.apache.pinot.tools.admin.PinotController.main(PinotController.java:35) Caused by: java.lang.ClassNotFoundException: org.apache.htrace.core.Tracer$Builder at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)

Bordin Suwannatri

03/10/2022, 10:14 AM

error --> ERROR [PinotFSFactory] [main] Could not instantiate file system for class org.apache.pinot.plugin.filesystem.HadoopPinotFS with scheme hdfs

sunny

03/11/2022, 1:48 AM

Hello! I'm working om Pinot poc. Does pinot have query audit logging ? I found query log in controller or broker log. (only query. not user) But I can't find any config or docs about query audit logging in Pinot.

abhinav wagle

03/14/2022, 9:25 PM

We did a basic POC of Pinot on bare metal EC2 hosts on AWS(And see Pinot will be an excellent fit for our use-cases) and are now at a stage to start working in the direction of production setup. Wanted to check with the community for production setup: Are there, any gotcha's on running the cluster via Kubernetes(helm) way vs dedicated EC2 hosts. Does Pinot have any performance benefit of running on dedicated EC2 instances vs Kubernetes POD's @Scale.

Naga Aravind

03/15/2022, 3:15 AM

Hi All , Currently I am working in Pinot POC to solve OLAP based needs in our org.Our architect team worried about using zookeeper/helix as metadata storage . Lets say , if want to run Pinot for 5 year in production, at certain point of time ,zookeeper will run into disk space full issue as it store all segments metadata in it. right ? . [Druid is using sql database (mysql/postgresdb) to store this kind of metadata] Please suggest right solution to solve this problem. also curious to know about how other big giant (Uber ,Linkedin , etc ) using pinot in their production setup.

Bordin Suwannatri

03/15/2022, 11:12 AM

Hi ALL, I'm in POC Pinot. Pinot support real time table with kafka SASL_SSL (authen kerbelos + cert ) ?

Diana Arnos

03/16/2022, 8:21 AM

Hello again! 😅 We are trying some production environment setups and I'm having trouble identifying the optimal configuration. Can you point me to some resources? I also need to find out how much storage I need to setup for the Controller, but I couldn't see anything related to that in the docs. I tried running with 1G (the default value) and 10G, but it wasn't enough. Segments are uploaded to Controller storage, right? On the thread, my schema, table configs and helm chart configs.

Saumya Upadhyay

03/16/2022, 10:00 AM

👋 Hi everyone! I am new to apache pinot, I am using it for realtime data ingestion from kafka topic, we are using confluent kafka and schema registry and avro schema. I am able to connect kafka topic as my table is successfully created and its in healthy state, but query is not showing any records. how can we check that it has some issues in consuming side. From swagger debug table api also I cannot see any errors.

Nizar Hejazi

03/16/2022, 4:58 PM

Hi everyone, can Apache Pinot supports a list of lists data type (a multi-valued column where values are also lists)? Thanks.

Grace Lu

03/16/2022, 7:12 PM

Hi team, I would like to get some suggestions about what does the pinot batch ingestion story look like in Production environment. Ideally we want to use spark cluster mode for ingestion in production, but we ran into lots of issue when submitting job in distributed fashion to our production spark clusters on yarn. Currently we only have spark local mode and pinot standalone ingestion working for batch data, but we are worried this will not be sustainable for ingesting larger production tables. What do people generally use for ingesting pinot data in production? Asking because I don’t see too much documentation and discussion around using spark generation job with yarn master and cluster deploy mode. Besides, we are at hadoop 2.9.1, spark 2.4.6 on yarn, pinot 0.9.2, also interested to know if anyone has successfully set up cluster mode batch ingestion with similar hadoop/spark environment👀.

pranay

03/16/2022, 8:04 PM

Can someone guide me on where to start reading the code. I want to understand architecture and codebase as well.

Monica

03/17/2022, 8:56 AM

Hi everyone, I found if one column doesn't have text index, and I use

TEXT_MATCH

function on it as predicate expression, like this:

Copy code

select * from transcript where TEXT_MATCH(firstName, 'firstName*') limit 10

it will throw exception like this:

Copy code

[
  {
    "message": "QueryExecutionError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.filter.TextMatchFilterOperator.getNextBlock(TextMatchFilterOperator.java:45)\n\tat org.apache.pinot.core.operator.filter.TextMatchFilterOperator.getNextBlock(TextMatchFilterOperator.java:30)\n\tat org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat org.apache.pinot.core.operator.DocIdSetOperator.getNextBlock(DocIdSetOperator.java:62)",
    "errorCode": 200
  },
  {
    "message": "QueryExecutionError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.filter.TextMatchFilterOperator.getNextBlock(TextMatchFilterOperator.java:45)\n\tat org.apache.pinot.core.operator.filter.TextMatchFilterOperator.getNextBlock(TextMatchFilterOperator.java:30)\n\tat org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat org.apache.pinot.core.operator.DocIdSetOperator.getNextBlock(DocIdSetOperator.java:62)",
    "errorCode": 200
  }
]

TEXT_MATCH

function only used by text index columns? Like Presto, if they can't predicate expressions to connectors, they will add filter operator on top of it.So maybe it's better for users to use pql if pinot supports this syntax too?

Luis Fernandez

03/17/2022, 2:57 PM

hey friends, i’m using the grafana dashboard that is provided in the pinot docs I was wondering what’s a healthy difference between

Table query latency

and

Server total query time per table

in my case for the same table for p99 in the broker side it reports avg 55ms but in the server the avg is 18ms so i was wondering if that difference is healthy or something i should look closer into

Diana Arnos

03/18/2022, 11:11 AM

Hey there, I'm trying to run the config recommendation engine and I didn't understand how can I fill the number of kafka partitions that we already have. And there is no example of the parameter name in the docs. I tried

Copy code

"partitionRuleParams": {
    "KAFKA_NUM_MESSAGES_PER_SEC_PER_PARTITION": 0.7,
    "KAFKA_NUM_PARTITIONS": 128
  },

But

KAFKA_NUM_PARTITIONS

is not recognized. How can I tell the current number of kafka partitions we have?

Romeo

03/19/2022, 2:26 PM

Hi there, how do I pass taints and tolerations as params to the helm chart? Been googling the syntax but can't find correct way to do it. An example will be much appreciated thanks

Nizar Hejazi

03/19/2022, 11:01 PM

Hey there, I want to write a transformation to convert a datetime column that has string values (e.g. ‘2022-03-19T110018.789Z’) or nulls into timestamps. Using inbuilt function (FromDateTime) throws

java.lang.NullPointerException

when the value is null:

Copy code

{
  "columnName": "updatedAt_timestamp",
  "transformFunction": "FromDateTime(updatedAt, 'yyyy-MM-dd''T''HH:mm:ss.SSS''Z''')"
}

Trying to use Groovy script as following but I see the following exception in the logs:

MissingPropertyException: No such property: DateTimeFormat for class: Script1

Copy code

{
  "columnName": "col_timestamp",
  "transformFunction": "Groovy({col == null ? null : DateTimeFormat.forPattern('yyyy-MM-dd\\'T\\'HH:mm:ss.SSS\\'Z\\'').withZone(DateTimeZone.forID(DateTimeZone.UTC.getID())).parseMillis(adminAccessGrantedOn)}, col)"
},

Do I need to import joda time classes to Groovy? Can I write a multi-line Groovy script as an ingestion transform? Any other workaround to deal w/ nulls in FromDateTime inbuilt function? (I can submit a PR to update date time functions to handle nulls). Please note that I have

"nullHandlingEnabled"

set to True.

Diogo Baeder

03/19/2022, 11:08 PM

Hi there folks! I've just published an article about YouGov starting to use Apache Pinot, in my personal blog - I can't yet publish it in a more appropriate YouGov public engineering blog, so I opted by having it in my own site for now. Here it is: https://diogobaeder.com.br/articles/2022/03/19/pinot-in-yougov.html

🙏 1

🎉 1

🍷 3

Saumya Upadhyay

03/21/2022, 1:21 PM

Hi All, is there any way that we can dump whole message from topic in one column, my use case is that we have very complex avro-schema , and to create schema and map every field is very tedious work. I have done some transformations using tranformationConfigs but for every topic and schema it is going to be very complex.

Saumya Upadhyay

03/22/2022, 7:11 AM

Thanks @User, I tried AvroSchemaToPinotSchema utility but it is giving error,

Copy code

Caused by: java.lang.IllegalStateException: Not one field in the RECORD schema at shaded.com.google.common.base.Preconditions.checkState

this error doesn't make sense as schema is correct and working with our messages.

👀 1

rajeh kalluri

03/22/2022, 5:29 PM

Hi Gang, this is Raj from Austin TX. I am a Cloud and Streaming Data enthusiast. Wanted to learn more about Pinot, and felt this is a good spot to be in the know.

👋 5

abhinav wagle

03/22/2022, 7:35 PM

Hi All. Is there a tool actively used by the community for benchmarking query performance of Pinot Cluster?

Nizar Hejazi

03/23/2022, 1:33 AM

Hey team, I see that support for storing decimals as byte[] and SumPrecision function was added in Pinot 0.6.0. We use Pinot through Presto\Trino. I think byte[] type will be mapped to Presto’s varbinary. I cannot find a builtin Presto function that is equivalent to Pinot’s bytesToBigDecimal? May need to define a Presto UDF. Do you plan to look into adding support for inferring Presto decimal type from Pinot byte[] (if Pinot can indicate somehow that byte[] stores a decimal to Presto) similar to

pinot.infer-date-type-in-schema

and

pinot.infer-timestamp-type-in-schema

settings in Presto Pinot connector. Also, did you consider adding support for a first-class decimal data type in Pinot. Thanks.

Paul-Armand Verhaegen

03/23/2022, 9:16 AM

Hi all, I'm Paul-Armand Verhaegen. I'm the Data Domain and Data Specialty Architect for a News Publisher (we operate in a couple of European countries). Interested in basically anything with data (science and engineering), electronics, making stuff, math, hard problems, crypto tech, also organisational things related to data mesh. Here to learn about Pinot, what it can do for us, and which datasketches are useful in our RT dashboards.

Prashant Pandey

03/23/2022, 10:11 AM

Hi team 🙂. We have a use-case where we’d like to coalesce small segments to larger ones. However, it’s a realtime table and we use RT2OFF to move segments to offline servers periodically. Is it possible to use the minion merge rollup task to merge the segments residing on OFFLINE servers (although the docs explicitly mention that it only supports OFFLINE tables)? Thanks 🙂

Varun Mukundhan

03/24/2022, 1:24 PM

Hi folks, I am new to pinot so would like to apologize in advance for the noob questions that be incoming: 1. Is there any major performance difference between using PQL and SQL? For example, we have a usecase where we need top X aggregations. I can do this through

top X using

PQL and

order by <aggregation> desc LIMIT X

using SQL. Which one do you reccomend? 2. What are the differences between metrics and dimensions? I could see aggregation queries are allowed on non-string dimensions as well

❤️ 1

Mourad DLIA

03/24/2022, 5:27 PM

Hi team, We want to paginate over a table but the offset keep changing due to new coming events. Is there a way to ignore new events during pagination?

abhinav wagle

03/24/2022, 7:00 PM

Hi, I am a little confused on the

Note

mentioned here : https://docs.pinot.apache.org/basics/getting-started/kubernetes-quickstart

Copy code

NOTE: Please specify StorageClass based on your cloud vendor. For Pinot Server, please don't mount blob store like AzureFile/GoogleCloudStorage/S3 as the data serving file system.
Only use Amazon EBS/GCP Persistent Disk/Azure Disk style disks.