Apache Pinot #general

Chundong Wang

02/15/2021, 6:41 PM

I’m wondering if there’re way other than groovy to get filter like “past 7 days” to work? Found this question back in 2017 about

select count(*) as cnt from  log where date >= DATE_SUB(NOW(),INTERVAL 1 HOUR);

Karin Wolok

02/16/2021, 3:53 PM

Online Meetup starting in 3 hours : Advanced Pinot Features: Upsert and JSON Indexing https://www.meetup.com/apache-pinot/events/275731277/

👍 5

Elon

02/17/2021, 1:37 AM

Hi, the meetup today was great! Wanted to know when the meetup slides will be available. We have some users very interested in upsert.

➕ 5

Nick Bowles

02/19/2021, 3:11 AM

So based off of the docs, since Pinot doesn’t have a specific date time format, and dates are converted to either strings, longs, or ints, does this hinder performance in any way? If it does, are there plans to add support for a datetime format?

vmarchaud

02/22/2021, 1:31 PM

Hey, question question: Is there any target date / milestone for the 0.7.0 release ? Thanks

Shawn Peng

02/23/2021, 1:09 AM

Hi, I’m trying to build a query for data within 7 days, but pinot is throwing error for

DATETRUNC('hour', second(now()), 'SECONDS')

, is this expected?

Karin Wolok

02/23/2021, 1:28 AM

🎉 We officially passed 1K slack members!!! 🎉 🥳 👋 Welcome to the newbie Pinot community members who brought us over the edge! 🍷 Would love to know what brought you here and what you're working on. 😃 @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User

🍷 3

🎉 5

👋 7

ayush sharma

02/23/2021, 10:42 PM

Hi people, I am facing an issue with starting ThirdEye on top of Pinot. I have got pinot successfully set up and running. Now, I am trying to run ThirdEye on top of this pinot using the docker apachepinot/thirdeye image. After running the following docker command, an Error stating

Database may be already in use

Please find the attached log file. Any help is appreciated!

Copy code

docker run \
    --network=pinot-demo \
    --name thirdeye \
    -p 1426:1426 \
    -p 1427:1427 \
    -d apachepinot/thirdeye:latest

Slack Conversation

Nick Bowles

02/26/2021, 5:35 PM

I put in a request for Gitbook access, if someone could check on that so I can start contributing to the docs I would appreciate it 🙂

Ken Krugler

02/26/2021, 6:00 PM

Really interesting article about Uber doing schema-agnostic log aggregations…but they went with ClickHouse, not Pinot?!? https://eng.uber.com/logging/

Vince Vinci

02/27/2021, 2:46 AM

Hi, not sure if this asked before, is there a way for pinot to aggregate into 15min / hourly into a new table, and remove the data from raw table, and if there's also late data from raw, can that be easily added back into the aggregated table? We wanted to reduce the storage required, and we wouldn't need them for long (we can keep it for 30 / 90 days)

Anupam Mukherjee

03/02/2021, 7:23 AM

Hi, I am from Cisco. we have recently decided to evaluate Apache Pinot for our cloud based analytic project. However while evaluation, I got stuck for one of our non functional requirements which is backup-restore. Can you please suggest how we can take periodic backup of Pinot to S3 for disaster recovery purpose?

Josh Highley

03/02/2021, 3:55 PM

Do lowlevel realtime tables support ingestionConfig-transformConfig ?

Alex

03/02/2021, 11:18 PM

and what about Upserts?

Josh Highley

03/03/2021, 2:55 PM

Ingesting JSON data into a realtime table. A field in the JSON is a JSON string with leading spaces but is always numeric data otherwise:

Copy code

{ "account":"      123", .....}

If my realtime table defines the account column as DOUBLE, then the record loads with no issue -- the spaces appear to be ignored. However, if I define the column as INT then the record does not load. More troublesome, I can't find any error messages in any of the logs -- I would expect some kind of error message?

Josh Highley

03/04/2021, 1:45 AM

When streaming data via Kafka to a realtime table, does it have to be 1 record per message or is there a way to put multiple records in a single message?

troywinter

03/05/2021, 3:48 AM

Does Pinot support change schema existing column name? I tried change a column name, but got following exceptions on query:

Copy code

[
  {
    "errorCode": 500,
    "message": "MergeResponseError:\nData schema mismatch between merged block: [time_to_hour(LONG),age_decade(STRING),age_level(STRING),city(STRING),company_id(STRING),company_name(STRING),count_impression(LONG),count_in(LONG),count_passby(LONG),create_time(LONG),day(STRING),day_in_week(STRING),district(STRING),gate_id(STRING),gender(STRING),holiday_id(STRING),holiday_name(STRING),hour(STRING),is_holiday(STRING),month(STRING),province(STRING),region(STRING),shop_id(STRING),shop_name(STRING),temperature(STRING),temperature_id(STRING),total_duration(LONG),total_impression_duration(LONG),weather_cate_id(STRING),weather_cate_name(STRING),year(STRING)] and block to merge: [time_to_hour(LONG),age_decade(STRING),age_level(STRING),city(STRING),company_id(STRING),company_name(STRING),count_impression(LONG),count_in(LONG),count_passby(LONG),create_time(LONG),day(STRING),day_in_week(STRING),district(STRING),gate_id(STRING),gender(STRING),holiday_id(STRING),holiday_name(STRING),hour(STRING),is_holiday(STRING),month(STRING),province(STRING),region(STRING),shop_id(STRING),shop_name(STRING),temperature(STRING),temperature_id(STRING),total_duration(LONG),total_impression_duraion(LONG),weather_cate_id(STRING),weather_cate_name(STRING),year(STRING)], drop block to merge"
  }
]

Pankaj Thakkar

03/05/2021, 6:54 AM

If we extend a table schema in Pinot to add new columns (so it does not break backward compatibility); do we have to backfill data or can Pinot use null/default values to handle the older segments?

👍 1

ayush sharma

03/05/2021, 7:36 PM

How to ingest Data into pinot on kubernetes using native batch ingestion? Hi, I am trying to ingest csv data into pinot deployed on kubernetes using LaunchDataIngestionJob arg. I have verified that the table has been created on pinot and the job-spec and csv data are present on the node. This is my job-spec file

Copy code

apiVersion: batch/v1
kind: Job
metadata:
  name: pinot-case-offline-ingestion
  namespace: my-pinot-kube
spec:
  template:
    spec:
      containers:
        - name: pinot-load-case-offline
          image: apachepinot/pinot:0.3.0-SNAPSHOT
          args: ["LaunchDataIngestionJob", "-jobSpecFile", "/opt/data/table-configs/case_history/job-spec.yml"]
          volumeMounts:
            - name: mount-data
              mountPath: /opt/data
      restartPolicy: OnFailure
      volumes:
        - name: mount-data
          hostPath:
            path: /opt/data
  backoffLimit: 100

After applying this job to node, nothing happens and this is the log of the pod.

Copy code

SegmentGenerationJobSpec: 
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**/*.csv
inputDirURI: /opt/data/csv_data/case_prod_data
jobType: SegmentCreationAndTarPush
outputDirURI: /pinot-segments/case_history
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: '<http://192.168.49.2:30892/>'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: null
recordReaderSpec:
  className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
  configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
  configs: {delimiter: '|', multiValueDelimiter: ''}
  dataFormat: csv
segmentNameGeneratorSpec:
  configs: {segment.name.prefix: case_history, exclude.sequence.id: 'true'}
  type: normalizedDate
tableSpec: {schemaURI: null, tableConfigURI: null, tableName: case_history}

Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS

Am I ingesting the data incorrectly ?

Jai

03/09/2021, 2:06 PM

what is this APAche Pinot all about

Manish Bhoge

03/10/2021, 3:00 PM

I'm trying to set up the docker image of Pinot, and to set up this I'm doing the maven build :

Copy code

# Build Pinot
$ mvn clean install -DskipTests -Pbin-dist

But, it is failing with an error, any idea on this below error: [ERROR] Failed to execute goal org.apache.maven.pluginsmaven shade plugin3.2.1:shade (default) on project pinot-yammer: Execution default of goal org.apache.maven.pluginsmaven shade plugin3.2.1:shade failed: Plugin org.apache.maven.pluginsmaven shade plugin3.2.1 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.apache.maven.sharedmaven artifact transferjar:0.10.0, org.ow2.asmasmjar7.0 Could not transfer artifact org.apache.maven.sharedmaven artifact transferjar:0.10.0 from/to central (https://repo.maven.apache.org/maven2): Connect to repo.maven.apache.org:443 [repo.maven.apache.org/151.101.12.215] failed: Connection timed out (Connection timed out) -> [Help 1]

Josh Highley

03/10/2021, 3:08 PM

what's the difference between

Copy code

bin/pinot-admin.sh StartServer

and

Copy code

bin/start-server.sh

? Which way should be used?

Ken Krugler

03/11/2021, 11:47 PM

If we want to get the total number of groups for a

group by

, I assume currently we have to do a separate

distinctcount

distinctcounthll

, right? But if the group by uses multiple columns, what’s the best approach to getting this total group count?

Anupam Mukherjee

03/12/2021, 11:30 AM

Hi we will be installing Pinot cluster in AWS on top of EKS. We know that in AWS EKS has Multi (Three) Availability Zone (AZ) based HA in a specific Region. So I would like to understand that whether the EKS based Pinot cluster will be by default Fault Tolerant & HA within the region in case of any AZ failure or not. I know that Pinot Server has Segment Replica and replica-group which provide HA within the cluster in case of server failure. But what will happen if the controller has issue in the cluster (on EKS) or multiple servers have been corrupted or the cluster (on EKS) as a whole goes down? Considering the fact that the server will have EBS as data serving file system (& EBS multi AZ replication/sync will be ON), will EKS by default bring up alternative node like Controller or Server (or even Broker)? Net-net can we expect 100% service availability in Pinot on EKS in any Region? Or do we need to setup another Pinot Cluster on EKS on another AZ i.e. minimum Two Pinot Cluster (On EKS) in Two AZ within a Region? Please suggest

Ravikumar Maddi

03/12/2021, 4:25 PM

I have a doubt, is It possible for a nested json data as Pinot table? Avro support nested entities(json) by using record type in Avro Schema. Like Avro, Pinot Table configuration supports nested json entities.(Like Account json contains address json as embedded. )

Ravikumar Maddi

03/12/2021, 4:28 PM

I have been gone through Pinot documentation that Pinot support Avro, but I am not able to find any samples or sample code regarding that. Can you help by referring some code with Pinot and Avor combination.

ayush sharma

03/12/2021, 7:18 PM

Hi all, I am writing this to explain the loop of problems that we are facing while working on the architecture having Superset (v1.0.1), *Pinot*(latest docker image) and Presto (starburstdata/presto:350-e.3 docker image). Working around a problem in one framework causes problem in the other. I do not know which community can help me to solve this hence, posting it on both. Till now: We have successfully pushed 1 million records in a pinot table and would like to build charts on Superset over it. Problem # 1 We connected superset to pinot successfully and were able to build SQL lab queries only to find out that Superset does not support Exploring of SQL Lab virtual data as a chart if the connected database is Apache Pinot. (The "Explore" button is disabled) Please let me know, if this can be solved or we interpreted it incorrectly as it will solve the whole problem at once. To work it around, we got to know that superset - presto connection would enable this Explore button and we had implementation of presto any-which ways in our plan. So, we implemented Presto on top of pinot. Problem # 2 We found that Presto cannot aggregate pinot records of count more than 50k throwing error

Segment query returned '50001' rows per split, maximum allowed is '50000' rows. with query "SELECT * FROM pinot_table  LIMIT 50001"

Presto cannot even query something like this:

Copy code

presto:default> select count(*) from pinot.default.pinot_table;

Even, if we increase the 50k limit of presto's pinot.properties

pinot.max-rows-per-split-for-segment-queries

to 1 million, the presto server crashes stating heap memory exceeded. To work it around, we got to know that we can make pinot to do the aggregations and feed the aggregated result to presto which will in turn feed the superset to visualize the charts, by writing the aggregation logic inside the sub query of presto like,

Copy code

presto:default> select * from pinot.default."select count(*) from pinot_table"

This returns the expected result. Problem # 3 We found that, though we can make pinot to do the aggregations, we cannot use the supported transformation function of pinot listed here, inside the sub query of presto. The query

Copy code

select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10

works fine in pinot but when embedded in presto as sub query like below does not work

Copy code

presto:default> select * from pinot.default."select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10";
Query failed: Column datetrunc('day',epoch_ms_col,'milliseconds') not found in table default.select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10

I do not know if we are doing something wrong while querying/implementing or have missed some useful config setting that can solve our problem. The SQL Lab query which we want to query from pinot and eventually use the result to make a chart is like

Copy code

SELECT 
    day_of_week(epoch_ms_col),
    count(*)
from pinot_table
group by day_of_week(epoch_ms_col)

Any help is really appreciated !!!

Ravikumar Maddi

03/13/2021, 8:32 AM

@All -- Few doubts: 1. Pinot Kafka Connector with Avro is possible. 2. If possible kindly any detailed document available online. I am fetching from one day, even no luck. Need Help 🙂

Ravikumar Maddi

03/15/2021, 8:28 AM

@All -- I created a flatten json from a lot nested actual json file. How can I create a pinot schema for flatten json, any sample are available.

Ravikumar Maddi

03/16/2021, 12:40 AM

Pinot - Not able to start zookeeper I am starting pinot components, as first step I am trying to start zookeeper. I am running the command to start zookeeper:

Copy code

bin/pinot-admin.sh StartZookeeper -zkPort 2181

But I am getting like this after some time:

Copy code

zookeeper state changed (SyncConnected)
Waiting for keeper state SyncConnected
Terminate ZkClient event thread.
Session: 0x1000014c3150000 closed
Start zookeeper at localhost:2181 in thread main
EventThread shut down for session: 0x1000014c3150000
Unable to read additional data from client sessionid 0x1000014c3150005, likely client has closed socket
Unable to read additional data from client sessionid 0x1000014c3150004, likely client has closed socket
Unable to read additional data from client sessionid 0x1000014c3150002, likely client has closed socket
Expiring session 0x1000014c3150004, timeout of 30000ms exceeded
Expiring session 0x1000014c3150005, timeout of 30000ms exceeded
Expiring session 0x1000014c3150002, timeout of 30000ms exceeded
Unable to read additional data from client sessionid 0x1000014c3150009, likely client has closed socket
Unable to read additional data from client sessionid 0x1000014c315000a, likely client has closed socket
Unable to read additional data from client sessionid 0x1000014c3150007, likely client has closed socket
Expiring session 0x1000014c315000a, timeout of 30000ms exceeded

I restarted the server based suggestions prescribed online. Even no luck. Need help 🙂