Apache Pinot #general

Tymm

12/21/2020, 8:19 AM

Hello, is it possible to use flink to sink data into pinot?

Tymm

12/24/2020, 6:32 AM

Hello, I'm running pinot on docker, and am creating and pushing new data/ segments (from csv files) into pinot every 1 minute. I realize that the amount of time to push the segment into pinot increases as the data/ segment increases, to the point where it take more than a minute to push the new segment into pinot. How can I make the pushing of a segment faster? Thanks.

Mark.Tang

12/28/2020, 1:49 AM

Hi, Is there any doc/post detailing comparison of Pinot with Kylin somewhere? Thanks.

Chundong Wang

12/29/2020, 6:01 PM

Hi team, when we tried to ran below query,

Copy code

SELECT facility_name as key_col, COUNT(*) as val_col
FROM enriched_station_orders_v1_OFFLINE
WHERE created_at_seconds BETWEEN 1606756268 AND 1609175468
AND (facility_organization_id <> 'ac56d23b-a6a2-4c49-8412-a0a0949fb5ef') 
GROUP BY key_col
ORDER BY val_col DESC
LIMIT 5

We’ll get exceptions on pinot-server like (index number seems to vary),

Copy code

Caught exception while processing and combining group-by order-by for index: 1

However if we change from

facility_organization_id <> 'ac56d23b-a6a2-4c49-8412-a0a0949fb5ef'

facility_organization_id = 'ac56d23b-a6a2-4c49-8412-a0a0949fb5ef'

there won’t be such exception. Or if we switch to

facility_id

instead of

facility_name

it won’t threw exception as well. Have you seen such issue before?

Will Briggs

12/30/2020, 10:07 PM

I apologize in advance for my ignorant question, but I’m struggling conceptually a bit with how to handle dateTime column definitions in my table schema and segmentsConfig. I have a millisecond-level epoch field on my incoming realtime data (creatively named

eventTimestamp

). I would like to maintain this when querying / filtering my records at the individual event level. However, I would also like to define an hourly derived timestamp to be used for pre-aggregating with a star tree index. My segments config looks like this:

Copy code

"segmentsConfig": {
        "timeColumnName": "eventTimestamp",
        "timeType": "MILLISECONDS",
        "retentionTimeUnit": "HOURS",
        "retentionTimeValue": "48",
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "schemaName": "mySchema",
        "replication": "1",
        "replicasPerPartition": "1"
      },

My star tree index looks like this:

Copy code

"starTreeIndexConfigs": [{
          "dimensionsSplitOrder": [
            "dimension1",
            "dimension2"
          ],
          "skipStarNodeCreationForDimensions": [
          ],
          "functionColumnPairs": [
            "SUM__metric1",
            "SUM__metric2",
            "SUM__metric3",
            "DISTINCT_COUNT_HLL__dimension3",
            "DISTINCT_COUNT_HLL__dimension4"
          ],
          "maxLeafRecords": 10000
        }],

And my dateTimeFieldSpecs:

Copy code

"dateTimeFieldSpecs": [
        {
          "name": "eventTimestamp",
          "dataType": "LONG",
          "format": "1:MILLISECONDS:EPOCH",
          "granularity": "1:HOUR",
          "dateTimeType": "PRIMARY"
        }
      ],

Can anyone confirm that this is the correct approach? Should I be using an ingestion transformation of

toEpochHoursRounded

instead, and specifying that as a DERIVED dateTimeField in the dateTimeFieldSpecs configuration, and manually adding that to the dimensionsSplitOrder of my star tree index?

Chethan UK

01/04/2021, 1:30 PM

Thread: Please drop the logo of your company if your are using Pinot I will update this section in docs https://pinot.apache.org/#who-uses

🍷 1

Jinwei Zhu

01/04/2021, 7:25 PM

Hi team, I'm new to pinot and trying to get logs for troubleshooting using logz. I deployed pinot using k8s, want to make sure are the logs of different components exist in the according pods? like gc-pinot-broker.log and pinotBroker.log. What's the different? How to change the log levels? Is the log seen in kubectl logs same as seen inside the pod log file?

Kishore G

01/05/2021, 1:01 AM

It automatically sorts it while ingesting from Kafka

Mayank

01/05/2021, 1:50 AM

The expectation is to have the partition function used by producer with the one defined in Pinot

Will Briggs

01/05/2021, 4:46 AM

Sorry to be a never-ending fount of questions, folks… is it expected / necessary to create a rangeIndex on dateTime fields, or are those automatically indexed efficiently? Likewise, should I add dateTime fields to the noDictionaryColumns list?

Mark.Tang

01/06/2021, 2:13 AM

• Hi Team, I have seen that in 0.4.0, pinot has implemented the initial version of theta-sketch based distinct count aggregation function, utilizing the Apache DataSketches library. Compared to Druid the latest release which has also included DataSketches extension(Theta sketch, Tuple sketch, Quantiles sketch ,HLL sketch), pinot has any plan to implement other sketchs other than Theta sketch). Thanks.

Oguzhan Mangir

01/06/2021, 12:05 PM

Hello, do pinot supports upsert for offline tables? or do it only supports that for realtime tables? for example; when late data arrived after the real-time segment is flushed, can pinot update it?

Mahesh Yeole

01/06/2021, 9:45 PM

Hello, Do we have any pinot DB benchmarks we can refer to ?

Jinwei Zhu

01/06/2021, 10:21 PM

Hi, is it possible to monitor Pinot DB metrics with Wavefront instead of Prometheus and Grafana? Are there any docs I can refer to? Thanks

Mark.Tang

01/07/2021, 6:24 AM

Hi, Team, a streaming app often does the following: 1. Read local files using flume into kafka 2. Do ETL transformation from kafka topic using flink 3. Pull data from flink into Linkedin's Pinot So, I am not doing direct map from kafka to pinot table just like https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion , any suggestion or example can help me, thanks!

Mark.Tang

01/07/2021, 9:58 AM

Hi team, Uber makes a contribution about Schema inference for saving a lot of manual effort. I think that while landing production, this capability is important. So, any plan for adding the capability into 2021 roadmap or currently has been implemented? Thanks! (https://eng.uber.com/operating-apache-pinot/)

👍 3

Jinwei Zhu

01/11/2021, 11:01 PM

Hi @User I'm working with @User and trying to use our new Pinot Kinesis support. Want to know do we have any images built on that? Because with the branch, we can not use it directly. Thanks

Jackie

01/13/2021, 7:24 PM

Yes, please add the

enableDynamicStarTreeCreation

into your index config, see https://docs.pinot.apache.org/configuration-reference/table#table-index-config for more details

👍 1

Neha Pawar

01/13/2021, 9:37 PM

@User we have the reload status API already. Works only for offline tables so far. You can check it out in the cluster manager on the table details page @User is working on adding the API support for realtime tables.

👍 1

Amit Chopra

01/13/2021, 11:39 PM

Few question on sorted index: 1. I was trying to create a sorted index on a STRING column. But it was not working. Then i tried it on a INT column and it worked. Is sorted index only supported on INT (or LONG) types? 2. I see isSorted = true in metadata.properties file for the event time as well as the metric column. Though i did not enable sorted index for those. What does this imply? Given i remember it was mentioned that only one column can be used as sorted index 3. Related to above, if most queries will have time in where clause, then should we add sorted index on time field? Or is it more beneficial to add sorted index on a field (used often to filter) other than time field?

Yupeng Fu

01/14/2021, 1:17 AM

hey, the new cluster management UI is very convenient and powerful (e.g. delete table)…. is there a plan to add access control to it

Mahesh Yeole

01/14/2021, 3:33 AM

I am trying to fetch PARQUET files from s3 and load into pinot DB. I am using offline table. I am running this command with my job spec ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/metrics/ingestionJobSpec.yaml I am seeing the following errors, any idea how to solve this issue ? Jan 13, 2021 63424 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr using format: (.+) version ((.) )?$build ?(.)$ at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:238) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:234) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPFailed to generate Pinot segment for file - s3://cdca-metrics-prod-us-east-1-eedr/eedr/events/event_date=2021-01-12/event_hour=12/20210112_235508_00031_tgepm_5672f969-021f-4dfd-a0ad-c209aaf7e84d java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:251) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:236) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:222) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:235) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:215) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:209) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:124) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f]

Sean Chen

01/14/2021, 9:13 AM

Hi team, is there a limit on number of znodes per parent node in ZK today?

Sean Chen

01/15/2021, 4:39 AM

Hi team, when should I set

exclude.sequence.id

? Is it used just for naming the segment? If I create 3 segments, each with a unique name but having the same time-range, can I set

exclude.sequence.id

true all the time?

Sean Chen

01/15/2021, 11:49 AM

I see. There is an explicit reload command

Amit Chopra

01/15/2021, 4:53 PM

Hi, I have a question around broker / server pruning. I have 2 servers and 4 segments. The mapping is: • server-0 1. metrics_OFFLINE_26835599_26835666_3 2. metrics_OFFLINE_26835733_26835799_2 • server-1 1. metrics_OFFLINE_26835799_26835866_0 2. metrics_OFFLINE_26835666_26835733_1 When i do a query like

select device, count(device) as aggreg from metrics where eventTime > 26835599 and eventTime < 26835626 group by device order by aggreg desc limit 10

I see: • numServersQueried = 2 • numServersResponded = 2 • numSegmentsQueried = 4 • numSegmentsProcessed = 1 • numSegmentsMatched = 1 Questions: 1. Given above query, the

eventTime

falls within time range of a single segment -

metrics_OFFLINE_26835599_26835666_3

. So i was expecting numServersQueried to be 1 (instead of 2). Do i need to set something up for broker pruning to take effect? 2. Similarly i was expecting numSegmentsQueried to be 1 (instead of 4). 3. I always see numSegmentsProcessed and numSegmentsMatched to be same value always. What is the difference between the two. I looked at https://docs.pinot.apache.org/users/api/querying-pinot-using-standard-sql/response-format, but it wasn’t super clear to me from reading there.

Ken Krugler

01/15/2021, 4:59 PM

Hi @User - I think you want to check out partitioning on https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing, as a way of avoiding sending the query to all servers (with broker-side pruning).

troywinter

01/18/2021, 6:33 AM

Hi, I’m getting an error when using lookup on a local cluster, does anyone know how to solve it?

Copy code

[
  {
    "errorCode": 200,
    "message": "QueryExecutionError:\norg.apache.pinot.core.query.exception.BadQueryRequestException: Caught exception while initializing transform function: lookup\n\tat org.apache.pinot.core.operator.transform.function.TransformFunctionFactory.get(TransformFunctionFactory.java:207)\n\tat org.apache.pinot.core.operator.transform.TransformOperator.<init>(TransformOperator.java:56)\n\tat org.apache.pinot.core.plan.TransformPlanNode.run(TransformPlanNode.java:52)\n\tat org.apache.pinot.core.plan.SelectionPlanNode.run(SelectionPlanNode.java:83)\n\tat org.apache.pinot.core.plan.CombinePlanNode.run(CombinePlanNode.java:100)\n\tat org.apache.pinot.core.plan.InstanceResponsePlanNode.run(InstanceResponsePlanNode.java:33)\n\tat org.apache.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:45)\n\tat org.apache.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:294)\n\tat org.apache.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:215)\n\tat org.apache.pinot.core.query.executor.QueryExecutor.processQuery(QueryExecutor.java:60)\n\tat org.apache.pinot.core.query.scheduler.QueryScheduler.processQueryAndSerialize(QueryScheduler.java:157)\n\tat org.apache.pinot.core.query.scheduler.QueryScheduler.lambda$createQueryFutureTask$0(QueryScheduler.java:141)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)"
  }
]

troywinter

01/18/2021, 3:18 PM

Another question regarding using hdfs as pinot deep storage, I have put hadoop-client-3.1.1.3.1.0.0-78.jar, hadoop-common-3.1.1.3.1.0.0-78.jar, hadoop-hdfs-3.1.1.3.1.0.0-78.jar, hadoop-hdfs-client-3.1.1.3.1.0.0-78.jar these jars in pinot controller’s classpath, but controller still reporting class not found for org/apache/hadoop/fs/FSDataInputStream, what other jars should I include? Below are the stack trace for this error:

Copy code

2021/01/18 10:26:32.704 INFO [ControllerStarter] [main] Initializing PinotFSFactory
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
	at java.lang.Class.getDeclaredConstructors0(Native Method)
	at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
	at java.lang.Class.getConstructor0(Class.java:3075)
	at java.lang.Class.getConstructor(Class.java:1825)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:295)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:264)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:245)
	at org.apache.pinot.spi.filesystem.PinotFSFactory.register(PinotFSFactory.java:53)
	at org.apache.pinot.spi.filesystem.PinotFSFactory.init(PinotFSFactory.java:74)
	at org.apache.pinot.controller.ControllerStarter.initPinotFSFactory(ControllerStarter.java:481)
	at org.apache.pinot.controller.ControllerStarter.setUpPinotController(ControllerStarter.java:329)
	at org.apache.pinot.controller.ControllerStarter.start(ControllerStarter.java:287)
	at org.apache.pinot.tools.service.PinotServiceManager.startController(PinotServiceManager.java:116)
	at org.apache.pinot.tools.service.PinotServiceManager.startRole(PinotServiceManager.java:91)
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.lambda$startBootstrapServices$0(StartServiceManagerCommand.java:234)
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startPinotService(StartServiceManagerCommand.java:286)
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startBootstrapServices(StartServiceManagerCommand.java:233)
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.execute(StartServiceManagerCommand.java:183)
	at org.apache.pinot.tools.admin.command.StartControllerCommand.execute(StartControllerCommand.java:130)
	at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:162)
	at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:182)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 21 more

And below are the startup opts:

Copy code

JAVA_OPTS	-Xms256M -Xmx1G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/opt/pinot/gc-pinot-controller.log -Dlog4j2.configurationFile=/opt/pinot/conf/pinot-controller-log4j2.xml -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs -classpath /opt/hadoop-lib/hadoop-common-3.1.1.3.1.0.0-78.jar:/opt/hadoop-lib/hadoop-client-3.1.1.3.1.0.0-78.jar:/opt/hadoop-lib/hadoop-hdfs-3.1.1.3.1.0.0-78.jar:/opt/hadoop-lib/hadoop-hdfs-client-3.1.1.3.1.0.0-78.jar

Davide Berdin

01/18/2021, 9:54 PM

Hello everybody! fantastic project 🚀 I’m totally in love with Apache Pinot ❤️ keep up the great work!

👋 4

🍷 4