Apache Pinot #general

Tiger Zhao

12/08/2021, 5:54 PM

Is there a way to set up the access configurations to easily limit tables to certain users? It looks like right now we can only limit users to certain tables?

Suraj

12/08/2021, 9:06 PM

Hello - we are exploring storing metrics at higher granularities by rolling up the data at lower granularities. Ex: 1s metrics rolled up and stored at 1 min granularity. Does pinot support percentile aggregations ?

Nicholas Yu

12/09/2021, 4:20 AM

hello friends, i’m looking for information around running spark batch ingestion jobs using AWS EMR. thanks

👍 1

Ty Brooks

12/09/2021, 11:08 PM

In the docs, there are references to the “Filesystem backend” and “Deep Storage”… are those meant to be conceptually synonymous?

Map

12/10/2021, 2:37 PM

Is there a way or an API to get the latest offset consumed for a real-time table/segment?

Lars-Kristian Svenøy

12/10/2021, 5:02 PM

Hey guys. Regarding https://nvd.nist.gov/vuln/detail/CVE-2021-44228 (The Log4j vulnerability) when can we expect a release of Pinot to mitigate that? I see you just recently merged a PR to deal with it: https://github.com/apache/pinot/pull/7889

Diogo Baeder

12/11/2021, 6:02 PM

Hi folks, I got a question about publishing events for Pinot realtime tables. I have this situation where I have tons of analytics logs backed up, and I want to send all that to Pinot, and also start sending logs in realtime. I'm preparing my table threshold time for 24h and size for 200M, however it's not clear how I can set up the tables so that I can have a cleaner "1 day of data per segment" kind of deal. Should I perhaps use hybrid tables, where I would publish the old logs to the offline table, and live logs to the realtime table? What do you recommend me doing in this case where the logs are out-of-order for uploading from my backups?

Diogo Baeder

12/12/2021, 2:34 PM

One more question, folks: when it comes to segments of ~200M in size, what segment storage technology would you recommend using when running a cluster in AWS? HDFS? S3? EFS mounted?

Ashish

12/12/2021, 9:07 PM

Is there any way to extract more than one fields from a json column? jsonextractscalar only allows one field at a time. So, if I do select jsonextractscala(jsonColumn, ‘field1’), jsonextractscalar(jsonColumn, ‘field2’), will it result in parsing the json document twice for each doc/row?

Ashish

12/13/2021, 12:52 AM

There does not seem to be a way to exclude properties in json path expression used by jsonextractscalar. I guess, only way seems to be write my own jsonextractscalars that calls json parser.delete(propertiesToDelete).read(propertiesToFetch) is my understanding right? Any other suggestions?

Xiang Fu

12/13/2021, 9:53 PM

<!here>

Copy code

Hello Community,

We are pleased to announce that Apache Pinot 0.9.1 is released!

Apache Pinot is a realtime distributed OLAP datastore, designed to answer OLAP queries with low latency use-cases.

This is a bug fix release that includes the upgrade to the latest log4j library, v2.15.0. This is our response to CVE-2021-44228.

The release can be downloaded at <https://pinot.apache.org/download>

The release note is available at <https://docs.pinot.apache.org/basics/releases/0.9.1>

Additional resources -
Project website: <https://pinot.apache.org>
Getting started: <https://docs.pinot.apache.org/getting-started>
Pinot developer blogs: <https://medium.com/apache-pinot-developer-blog>
Intro to Pinot Video: <https://www.youtube.com/watch?v=T70jTTYhYyM>

Join Pinot Community -
Twitter: <https://twitter.com/ApachePinot>
Meetup: <https://www.meetup.com/apache-pinot/>
Slack channel: <https://communityinviter.com/apps/apache-pinot/apache-pinot>

Best Regards,

Apache Pinot Team

❤️ 5

👍 22

Weixiang Sun

12/14/2021, 5:29 PM

We are working on offline segment ingestion. Currently we are using the TarPush. But its problem is that the controller need get involved with the data path by downloading the segment. Just curious, how does metadata push prevent the controller getting involved with data path?

Chris Theodore Jayakumar

12/14/2021, 11:30 PM

Hello folks, what is the recommended system specs for each of the services required for a pinot cluster. Is there a formula to calculate this based on the size of the data?

🍷 1

Xiang Fu

12/15/2021, 8:32 AM

Hello <!here>, We are pleased to announce that Apache Pinot 0.9.2 is released! Apache Pinot is a realtime distributed OLAP datastore, designed to answer OLAP queries with low latency use-cases. This is a bug fixing release contains: - Upgrade log4j to 2.16.0 to fix CVE-2021-45046 (#7903) - Upgrade swagger-ui to 3.23.11 to fix CVE-2019-17495 (#7902) - Fix the bug that RealtimeToOfflineTask failed to progress with large time bucket gaps (#7814). The release can be downloaded at https://pinot.apache.org/download The release note is available at https://docs.pinot.apache.org/basics/releases/0.9.2 Additional resources - Project website: https://pinot.apache.org Getting started: https://docs.pinot.apache.org/getting-started Pinot developer blogs: https://medium.com/apache-pinot-developer-blog Intro to Pinot Video:

https://www.youtube.com/watch?v=T70jTTYhYyM▾

Join Pinot Community - Twitter: https://twitter.com/ApachePinot Meetup: https://www.meetup.com/apache-pinot/ Slack channel: https://communityinviter.com/apps/apache-pinot/apache-pinot Best Regards, Apache Pinot Team

🙌 17

Map

12/16/2021, 5:05 PM

Hi what would be easiest way to clean up all the pinot configs for a cluster in Zookeeper?

Jeff Moszuti

12/16/2021, 8:21 PM

I'll like to try out tag-based instance assignment. Which file do I need to edit to set the TAG_LIST for a server?

Ashish

12/16/2021, 10:03 PM

Pql support is being deprecated but is the pql result format is going to be supported for sql queries? pql format seems to be more efficient for group by/aggregate queries.

Weixiang Sun

12/18/2021, 1:30 AM

What is the difference between dimensionFieldSpecs and metricFieldSpecs? When should we use them?

Prashant Pandey

12/20/2021, 6:37 AM

We are planning to migrate Pinot to a new kafka cluster. Our plan is to point Pinot to the new endpoints, and update

segment.realtime.startOffset

of each CONSUMING segment to 0, and restart the servers. Do we need to take care of anything else?

Slackbot

12/21/2021, 3:19 PM

This message was deleted.

Evan Galpin

12/21/2021, 10:15 PM

nvm, I think I found my answer in code[1]: Yes, all values in the MV column (array) are taken into account. It would be interesting to be able to filter at that level as well. Ex. an equality check to count only elements in the column equal to an input value:

Copy code

COUNTMATCHMV(my_column, "foo")

where a string MV column containing:

Copy code

["foo", "bar", "foo", "baz"]

might return 2. Thoughts on the feasibility? [1] https://github.com/apache/pinot/blob/f8c7e1fc8603f4091e418f3841dcb6bc2d75d5d8/pino[…]core/query/aggregation/function/CountMVAggregationFunction.java

Anshu Jalan

12/22/2021, 5:10 AM

In rollup, its mentioned in the doc as (perform metrics aggregations across common dimensions + time), so will it treat all dimension and time columns as primary key to aggregate the metrics? Also, in dedup what is meant by duplicate rows (which columns are used)?

Sunil Chaurasia

12/22/2021, 7:36 AM

Hey Guys, I am Sunil, My organisation is planning to use the Pinot for some of our use cases, currently we are in sort of POC phase. I would like to get some information around the benchmarking, if any one has done in this group. Also, I would like to know your opinion around taking the Managed service vs self managed. It would be really great if any one can help me on this.

Prashant Pandey

12/22/2021, 12:13 PM

Hi folks, we wanted to change our table names (from camel case to snake case) in Pinot. For this, we supplied existing table configs to the create table api with the changed table name (all other configs remained unchanged), and disabled the old tables. But we observed that the new tables contained data quite old (that wasn’t present in kafka). For example, our kafka retention is 2h but the new table still contained data as old as 6h old! Is there some sort of data migration happening from old segments to new segments?

Anshu Jalan

12/24/2021, 9:19 AM

As per the design doc: UpsertConfig can also include customMergeStrategies if Groovy mergers is enabled.

Copy code

{
   "upsertConfig":{
      "mode":"PARTIAL",
      "globalUpsertStategy": "OVERWRITE",
      "customMergeStrategies":{
         "field3":"Groovy({firstName+' '+lastName}, firstName, lastName)"
      },}
}

so these customMergeStrategies is executed before or after transformConfigs?

Priyank Bagrecha

12/30/2021, 1:19 AM

wget <https://downloads.apache.org/pinot/apache-pinot-0.9.3/apache-pinot-0.9.3-bin.tar.gz>

seems to be timing out. tried locally as well as from aws ec2 instances.

Vinod Adwani

12/30/2021, 8:41 AM

Hi folks! I am facing some issues in Kafka stream ingestion in pinot. Pinot is able to connect to Kafka but not able to consume any records or create segments. Can someone please help me?

sample_kafka_message.txt table_schema.json table_config.json

Syed Akram

12/30/2021, 9:00 AM

Hi folks, when can we expect a Pinot release with log4j 2.17.1? @User https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44832

Abhishek Kedia

01/03/2022, 10:36 AM

Hi everyone, my team is facing error reading data from Confluent Kafka to Pinot. Does anyone here have experience with the Confluent Kafka -> Pinot ? Would appreciate any help here.

xtrntr

01/04/2022, 7:45 AM

2nd question: if i plan to use lookup table for joins, i can only use it for decorating the query results - if i use lookup joins in the WHERE clause, queries will be very slow because the lookup join cannot benefit from indexing. is my understanding correct here?