Apache Pinot #general

Nicolas Hennebelle

01/27/2023, 3:54 PM

hello to all, I have a table with nullHandlingEnabled=true. I am using the v1 engine. I can run queries with null predicates fine: select * from mytable where mycolumn is null If I do SET enableNullHandling=true; select * from mytable where mycolumn is null I get an error “QueryExecutionError\njava.lang.IllegalStateException Null handling cannot be enabled for data table version smaller than 4” Version 4, I suppose, means it’s to use with the v2 engine. What is the option intended for in v2?

Steven Hall

01/27/2023, 5:47 PM

Question on importing partitioned parquet data from S3… I have looked at the docs, examples, and searched Slack, but did not find an answer. I have a synthetic data set based on the transcript data example from the docs. That data is produced in Spark.

Copy code

transcripts_bucket = "<s3://transcript-parquet/1/>"
transcript_df.write.mode("overwrite").parquet(transcripts_bucket)

I run a Spark job to transform into segments and import the data into Pinot

Copy code

spark_args = {
    'master': '<spark://spark-master:7077>',
    'deploy_mode': 'cluster',
    'name': 'segments-from-parquet',
    'class': 'org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand',
    'executor_memory': '2G',
    'executor_cores': '1',
    'total_executor_cores': '2',
    'verbose': True,
    'conf': [f"spark.driver.extraJavaOptions='{EXTRA_JAVA_OPTIONS}'"],
    'main_file_args': '-jobSpecFile=/home/job_specs/transcript_job_spec.yml'
    }

main_file = f'{PINOT_DISTRIBUTION_DIR}/lib/pinot-all-{PINOT_VERSION}-jar-with-dependencies.jar'

app = SparkJob(main_file, **spark_args)
app.submit()

All columns populate as expected. Schema is

Copy code

{
  "schemaName": "transcript_indexed",
  "dimensionFieldSpecs": [
    {
      "name": "studentID",
      "dataType": "INT"
    },
    {
      "name": "firstName",
      "dataType": "STRING"
    },
    {
      "name": "lastName",
      "dataType": "STRING"
    },
    {
      "name": "gender",
      "dataType": "STRING"
    },
    {
      "name": "subject",
      "dataType": "STRING"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "score",
      "dataType": "FLOAT"
    }
  ],
  "timeFieldSpec": {
    "incomingGranularitySpec": {
      "name": "examTime",
      "dataType": "LONG",
      "timeType": "MILLISECONDS"
    }
  }
}

If I partition my data by subject when I write the parquet files, I have an unexpected outcome, the subject field in the Pinot segments is null.

Copy code

transcripts_bucket = "<s3://transcript-parquet/1/>"

transcript_df.write.mode("overwrite").partitionBy("subject").parquet(transcripts_bucket)

Are we thinking about this incorrectly…. in a way that Pinot does not support? Alternately, is there some change we need to make in the configs to work with data lake data that is normally partitioned? The data on Minio — my S3 service fake looks like this once partitioned by subject

vishal

01/27/2023, 6:03 PM

Hi team, we've changed some code in pinot and now we are trying to install pinot using docker. we tried to use docker file which is there in project but its downloading base code from github can somebody help me to use our code instead of github code? @Mayank @Xiang Fu @Neha Pawar @saurabh dubey @Tim Santos @Xiaobing Thanks

Steven Hall

01/27/2023, 6:39 PM

image.png

Arthur Zhou

01/27/2023, 10:41 PM

Hi team, I have a table with 13 primary keys and upsert enabled. Data volume is 1k messages/second and 900KB/second. I have distributed pinot traffic to 10 servers. Does anyone know what’s the suggested heap size for a pinot server? Thanks. Table config is as below:

Copy code

{
  "REALTIME": {
    "tableName": "howler_ad_mainst_battlestation_order_updates_REALTIME",
    "tableType": "REALTIME",
    "segmentsConfig": {
      "schemaName": "howler_ad_mainst_battlestation_order_updates",
      "replication": "1",
      "replicasPerPartition": "1",
      "timeColumnName": "time",
      "minimizeDataMovement": false
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant",
      "tagOverrideConfig": {}
    },
    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "noDictionaryColumns": [],
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "howler_ad_mainst_battlestation_order_updates",
        "stream.kafka.broker.list": "confluent-broker.roles.service.robinhood:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
        "realtime.segment.flush.threshold.rows": "0",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.segment.size": "100M"
      },
      "rangeIndexColumns": [],
      "rangeIndexVersion": 2,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "optimizeDictionaryForMetrics": false,
      "noDictionarySizeRatioThreshold": 0
    },
    "metadata": {},
    "quota": {},
    "routing": {
      "instanceSelectorType": "strictReplicaGroup"
    },
    "query": {},
    "upsertConfig": {
      "mode": "FULL",
      "hashFunction": "NONE"
    },
    "ingestionConfig": {},
    "isDimTable": false
  }
}

Peter Pringle

01/28/2023, 10:01 AM

How's the #v2 query engine build coming along?

Josh Black

01/28/2023, 7:26 PM

Apologies if this is the wrong channel but I've had a question for the last few weeks and can't seem to find any clear answers. Put simply, is there anything that prevents Pinot from being used as a general-purpose CQRS read store? Not as a source of truth, but rather a place where the current state of some item can live for read-only, eventually-consistent access after making its way through, say, Kafka and the various consumers that build the current state for that item. The availability, low latency, and even the ability to do certain joins seem to make it a perfect fit but every time I see anything about Pinot use cases, it's always in the context of analytics. Is this beyond the scope of what Pinot is intended for? If so, I'm interested to understand the limitation. Thanks!

Yanwen

01/30/2023, 6:06 PM

Hi, if I have some pinot sql related question shall I ask in this channel? does pinot sql support group by time interval? such as if I want to group by 2 days etc... now my query only supports one day

Sathish Kumar

01/30/2023, 6:50 PM

Looking for broader perspectives and pointers to dig deeper into this line of thought - How does Pinot compare with Elasticsearch aggregations? Can it be a good replacement the latter? [Context - I'm assessing options to build an near real time analytics engine for 10s of petabytes of mutable data. Druid and Clickhouse in the basket as well.]

Mayank

01/30/2023, 11:27 PM

Any ChatGPT enthusiasts here? Would be awesome if can train it on the conversations in this forum, to be able to answer questions here 😇

Subbu Subramaniam

01/31/2023, 2:15 AM

Lol, here is what I got yesterday:

Copy code

ME: Can you explain pinot to me
CG: Sure! Pinot is a type of wine grape that is used to make a variety of wines, including Pinot Noir, Pinot Grigio, and Pinot Blanc.

Pinot Noir is a red wine grape variety that is know ... 
(goes on for half page)

ME:But I thought pinot is a database
CG: I apologize for any confusion. Pinot is a real-time distributed OLAP datastore, which is a type of database designed for real-time analytics. ...
(goes on a bit)

ME: Then why did you say it is a wine grape?
CG: I apologize for any confusion, I made a mistake. Pinot is indeed a database technology, not a type of wine grape. I apologize for any confusion my previous message may have caused.

ME: OK, please note it for the next person who asks this question. 
CG: As a language model, I do not have the ability to remember previous interactions or to store information for future use. I apologize for any confusion my previous response may have caused. Each time someone interacts with me, I generate responses based on the input given to me at that specific moment. I am only able to provide information based on the knowledge that I have been trained on, which has a cutoff date of 2021. Please let me know if you have any other question, I'll do my best to assist you.

[This last part is I think a "I give up" answer. I got it a few times today]

😂 11

vishal

02/01/2023, 7:32 AM

Hi Team, i am trying to code change in pinot. i am implementing something after minion-task executes. is there any override method for taskexecute? @Xiang Fu @Tim Santos @saurabh dubey @Xiaobing @Mark Needham

Nickel Fang

02/03/2023, 9:48 AM

Hi, team. I have a SQL like this.

Copy code

select region, clock,
cast(sum(error_count)/sum(request_count)>=0.2 as int) as "critical_status",
cast(sum(error_count)/sum(request_count) between 0.1 and 0.2 as int) as "warning_status"
from system_health where 
clock>=1675239600000 and clock<1675240200000
group by region, clock limit 500000

if I want to define the threshold of “sum(error_count)/sum(request_count)>=0.2” or “sum(error_count)>=5" as “critical_status”, how can I get the correct sql.

Copy code

cast(sum(error_count)/sum(request_count)>=0.2 or sum(error_count)>=5 as int) as "critical_status"

it doesn’t work. Thanks!

Uday Vallamsetty

02/04/2023, 5:19 AM

Happening Now: https://www.meetup.com/real-time-analytics-meetup-gurgaon/events/291043059/ https://zomato.zoom.us/j/89479542216 “Huddle by Zomato & Blinkit x StarTree: Real Time Analytics with Apache Pinot”

Ankit Sultana

02/06/2023, 11:42 PM

Hi folks, does anyone know what's the order in which indices are checked for a query? Say I have a query like the following:

Copy code

select * from table where col1 >= 10 and col2 = 'foo' and col3 = 'bar' and col4 < 20

Assuming col1, col4 have range indices and col2, col3 have inverted indices. Would the indices be picked based on the order of the filter conditions?

Lakshmanan Velusamy

02/07/2023, 1:21 AM

Hello Community, Got couple of questions on text search: • Does pinot support text search / text index on dim tables ? • Pinot seems to use lucene standard english tokenizer. Is it possible to configure the tokenizer to support non english strings ?

piby

02/07/2023, 8:41 AM

Hi all, Is there an easy way to ingest pinot realtime metrics into a table? Would be great to query and visualise the metrics directly from superset

Rohit Anilkumar

02/07/2023, 5:29 PM

Has anyone did load testing on apache pinot using Jmeter?if yes, i have a few questions

Sharayu Gandhi

02/07/2023, 11:23 PM

Hi team, I have a general query. Given schema, total no of rows in Pinot table and some cardinality information per column, I wanted to see how much will be approximate Pinot table size on disk.

Hassan Ait Brik

02/08/2023, 4:36 PM

Hi team, I have a question about pinot 0.10.0 and Azure Data Lake Storage. Is there a way to replace a segment if it already exists otherwise just push it using the Job spec ?

Richard Walker

02/08/2023, 10:14 PM

Hi all, I'm doing a proof of concept for using Pinot, and have been playing with the v2 query engine (which works a treat!) - however I'm using the Go pinot client (https://github.com/startreedata/pinot-client-go/) and haven't figured out how to query via the v2 engine (the ExecuteSQL function takes a single table and an SQL table as parameters, and it looks like the underlying code does broker selection based on a single table). Any pointers? I don't have to use

pinot-client-go

necessarily if there's a lower-level option...

abhinav wagle

02/09/2023, 10:25 PM

hellos, can someone point me to Spark-Pinot connector to read data from Pinot

chandarasekaran m

02/10/2023, 11:26 AM

is Tiered Storage available in opensource Pinot?

Abhijeet Kushe

02/13/2023, 4:19 PM

https://endoflife.date/java The LTS of OpenDJK 11 will end in September.Is there any plan to release Vendor based docker images like Amazon Corretto 11 or Microsoft jdk 11 for azure ?

Abhijeet Kushe

02/13/2023, 4:20 PM

Screen Shot 2023-02-13 at 11.19.57 AM.png

Weixiang Sun

02/14/2023, 4:47 AM

It seems pinot provides tier storage functionality. Does Pinot provide any metrics telling how old data to be stored with cold storage? Or we have to decide by ourself.

Shriniket

02/14/2023, 11:48 AM

I am exploring for the tools that can be used for CI/CD for Apache Pinot. I have used flywaydb and liquibase in past for such use cases in standard rdbms llike oracle, postgres. Not able to find any such tool for Pinot. Can someone help?

Irtisaam

02/14/2023, 1:38 PM

Hey Everyone! I'm a beginner and using pinot for the first time! Can anyone help me with how to run pinot with kafka on docker-compose? just a basic tutorial maybe!will be much helpful!

Aamish Tariq

02/16/2023, 8:34 AM

hey everyone i am a beginner and i want to study apache pinot in detail like history,why to use and why its better then others. Can you recommend a good article or video?

vishal

02/16/2023, 1:02 PM

Hi Team, i am trying to use

/segments/{tableName}/startReplaceSegments

API in swagger api and trying to replace segment here i am trying to replace one segment with other but i am not bale to do it. i am passing

segmentsFrom

those segment which we want to replace with new one.

segmentsTo

is the other segment but before that i have to create and upload that segment. how to create and upload it? i tried to use segment from same table but getting a error as below:

Copy code

Input:
{
    "segmentsTo":["tmptab__4__1__20230214T1305Z"],
    "segmentsFrom":["tmptab__0__0__20230214T0611Z"]
    }

Copy code

{
  "code": 500,
  "error": "Any segments from 'segmentsTo' should not be available in the table at this point. (tableName = 'tmptab_REALTIME', segmentsFrom = '[tmptab__0__0__20230214T0611Z]', segmentsTo = '[tmptab__4__1__20230214T1305Z]', segmentsFromTable = '[tmptab__0__1__20230214T1305Z, tmptab__2__1__20230214T1305Z, tmptab__3__1__20230214T1305Z, tmptab__0__0__20230214T0611Z, tmptab__1__1__20230214T1305Z, tmptab__4__1__20230214T1305Z, tmptab__5__1__20230214T1305Z, tmptab__2__0__20230214T0611Z, tmptab__1__0__20230214T0611Z, tmptab__4__0__20230214T0611Z, tmptab__3__0__20230214T0611Z, tmptab__5__0__20230214T0611Z]')"
}

@saurabh dubey @Xiang Fu @Tim Santos @Xiaobing @Jackie @Kartik Khare