Apache Pinot #getting-started

Amit Chopra

12/11/2020, 4:50 PM

Then i changed config as mentioned in https://docs.pinot.apache.org/operators/operating-pinot/decoupling-controller-from-the-data-path. And now segments are not being written to S3. I do see segments being created, as they show up on query browser. But the segments show up as status BAD. Can someone help to point what is wrong with the configuration: Configs: controller.conf ------------------------------ controller.helix.cluster.name=pinot-quickstart controller.port=9000 controller.enable.split.commit=true controller.allow.hlc.tables=false controller.data.dir=/tmp/pinot-tmp-data/ controller.local.temp.dir=/tmp/pinot-tmp-data/ pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.region=us-west-2 pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher controller.zk.str=pinot-zookeeper:2181 pinot.set.instance.id.to.hostname=true server.conf ------------------ pinot.server.netty.port=8098 pinot.server.instance.enable.split.commit=true pinot.server.adminapi.port=8097 pinot.server.instance.dataDir=/tmp/pinot-tmp/server/index pinot.server.instance.segment.store.uri=s3://pinot-quickstart-s3/pinot-data/pinot-s3-example/controller-data pinot.server.instance.segmentTarDir=/tmp/pinot-tmp/server/segmentTars pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.server.storage.factory.s3.region=us-west-2 pinot.server.segment.fetcher.protocols=file,http,s3 pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher pinot.set.instance.id.to.hostname=true pinot.server.instance.realtime.alloc.offheap=true table conf ---------------------- { “REALTIME”: { “tableName”: “demo1_REALTIME”, “tableType”: “REALTIME”, “segmentsConfig”: { “timeType”: “MILLISECONDS”, “schemaName”: “demo1", “timeColumnName”: “mergedTimeMillis”, “retentionTimeUnit”: “DAYS”, “retentionTimeValue”: “60", “replication”: “1", “replicasPerPartition”: “1", “completionConfig”: { “completionMode”: “DOWNLOAD” }, “peerSegmentDownloadScheme”: “http” }, “tenants”: { “broker”: “DefaultTenant”, “server”: “DefaultTenant” }, “tableIndexConfig”: { “streamConfigs”: { “streamType”: “kafka”, “stream.kafka.consumer.type”: “lowlevel”, “stream.kafka.topic.name”: “demo1", “stream.kafka.decoder.class.name”: “org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder”, “stream.kafka.consumer.factory.class.name”: “org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory”, “stream.kafka.zk.broker.url”: “z-1.pinot-quickstart-msk-d.9sahwk.c7.kafka.us-west-2.amazonaws.com:2181,z-3.pinot-quickstart-msk-d.9sahwk.c7.kafka.us-west-2.amazonaws.com:2181,z-2.pinot-quickstart-msk-d.9sahwk.c7.kafka.us-west-2.amazonaws.com:2181", “stream.kafka.broker.list”: “b-2.pinot-quickstart-msk-d.9sahwk.c7.kafka.us-west-2.amazonaws.com:9092,b-1.pinot-quickstart-msk-d.9sahwk.c7.kafka.us-west-2.amazonaws.com:9092", “realtime.segment.flush.threshold.time”: “10m”, “realtime.segment.flush.threshold.size”: “10000", “stream.kafka.consumer.prop.auto.offset.reset”: “smallest” }, “enableDefaultStarTree”: false, “enableDynamicStarTreeCreation”: false, “loadMode”: “MMAP”, “autoGeneratedInvertedIndex”: false, “createInvertedIndexDuringSegmentGeneration”: false, “aggregateMetrics”: false, “nullHandlingEnabled”: false }, “metadata”: { “customConfigs”: {} } } }

Ting Chen

12/11/2020, 8:00 PM

In our set up, controller.data.dir points to the deep store and also is consistent with the server upload destination.

Mahesh Yeole

12/14/2020, 9:04 PM

@User @User I see lot of files are written to S3 under same timestamp but i do see error on controller as well as server. I see on cluster manager console, segment keep showing consuming…. We are tying to use split commit feature but even setting split.commit to true for both controller and server , we do see "isSplitCommitType":false in server error. error on server logs [LLRealtimeSegmentDataManager_pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z] [pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z] CommitEnd failed with response {"isSplitCommitType":false,"streamPartitionMsgOffset":null,"buildTimeSec":-1,"status":"FAILED","offset":-1} Error on controller logs [SegmentCompletionFSM_pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z] [grizzly-http-server-1] Caught exception while committing segment file for segment: pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z java.io.IOException: software.amazon.awssdk.services.s3.model.NoSuchKeyException: The specified key does not exist. (Service: S3, Status Code: 404, Request ID: E62169F11317304B, Extended Request ID: 3dlRY25FjPWIVJsA82PfQnhwlyp/26Nw1VM2xZCzlqEUvNSIXpFSexbvMewbLTR3ZuaDSHE6rq8=) This is my controller.conf controller.helix.cluster.name=pinot-cluster controller.port=9000 controller.local.temp.dir=/var/pinot/controller/data controller.data.dir=s3://pinot-cluster-segment-s3/pinot-data/pinot-s3-example/controller-data/ controller.zk.str=pinot-zookeeper:2181 pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.region=us-west-2 pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher controller.allow.hlc.tables=false controller.enable.split.commit=true pinot.set.instance.id.to.hostname=true This is my server.conf pinot.server.netty.port=8098 pinot.server.adminapi.port=8097 pinot.server.instance.dataDir=/var/pinot/server/data/index pinot.server.instance.segmentTarDir=/var/pinot/server/data/segment pinot.set.instance.id.to.hostname=true pinot.server.instance.realtime.alloc.offheap=true pinot.server.instance.segment.store.uri=s3://pinot-cluster-segment-s3/pinot-data/pinot-s3-example/controller-data/ pinot.server.instance.enable.split.commit=true pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.server.storage.factory.s3.region=us-west-2 pinot.server.segment.fetcher.protocols=file,http,s3 pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcherroot@pinot-server-0:/opt/pinot#

Amit Chopra

01/13/2021, 6:46 PM

ok, thanks. Let me try and see what difference it makes

Jackie

01/13/2021, 6:53 PM

Do you have it configured explicitly? The config key is

pinot.server.query.executor.num.groups.limit

Zac Farrell

01/20/2021, 8:12 PM

Hey folks - i'm trying to get the jdbc client working but running into an issue:

Copy code

java.lang.NoClassDefFoundError: org/apache/pinot/client/JsonAsyncHttpPinotClientTransportFactory

i've tried running both v0.6.0 (latest) and 0.5.0 (version called out in docs) but both produce the same error. I've also tried compiling the jar from source, as well as including it as an explicit dependency in maven. Any help is appreciated, thanks!

Mohit Singh

05/23/2021, 2:42 PM

Hello Everyone.. i am trying to inject data from kafka topic to apache pinot but i didn't see any data loaded do i am missing anything in config related to avro ? Schema

Copy code

{
  "schemaName": "test_schema",
  "dimensionFieldSpecs": [
    {
      "name": "client_id",
      "dataType": "STRING"
    },
    {
      "name": "master_property_id",
      "dataType": "INT"
    },
    {
      "name": "business_unit",
      "dataType": "STRING"
    },
    {
      "name": "error_info_str",
      "dataType": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "timestamp",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

Table:

Copy code

{
  "REALTIME": {
    "tableName": "test_schema_REALTIME",
    "tableType": "REALTIME",
    "segmentsConfig": {
      "schemaName": "test_schema",
      "replication": "1",
      "replicasPerPartition": "1",
      "timeColumnName": "timestamp"
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant",
      "tagOverrideConfig": {}
    },
    "tableIndexConfig": {
      "bloomFilterColumns": [],
      "noDictionaryColumns": [],
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "invertedIndexColumns": [],
      "rangeIndexColumns": [],
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "loadMode": "MMAP",
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "TestTopic",
        "stream.kafka.broker.list": "localhost:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "schema.registry.url": "<http://localhost:8081>",
        "realtime.segment.flush.threshold.rows": "0",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.segment.size": "100M"
      }
    },
    "metadata": {},
    "quota": {},
    "routing": {},
    "query": {},
    "ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "error_info_str",
          "transformFunction": "json_format(error_info)"
        }
      ]
    },
    "isDimTable": false
  }
}

Kafka Avro Schema:

Copy code

{
  "type": "record",
  "name": "TestRecord",
  "namespace": "com.test.ns",
  "fields": [
    {
      "name": "client_id",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "master_property_id",
      "type": "int"
    },
    {
      "name": "business_unit",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "error_info",
      "type": {
        "type": "record",
        "name": "ErrorInfo",
        "fields": [
          {
            "name": "code",
            "type": [
              "null",
              "string"
            ]
          },
          {
            "name": "description",
            "type": [
              "null",
              "string"
            ]
          }
        ]
      }
    },
    {
      "name": "timestamp",
      "type": [
        "null",
        "long"
      ],
      "default": null
    }
  ]
}

✅ 1

Kaushik Ranganath

06/07/2021, 4:01 AM

When I do a kubectl get all -n pinot-quickstart, I see this has brought up classic load balancers to expose both the broker and the controller on tcp ports, and when I make a curl/browser request to the DNSs, I expect these to show up the UI for the broker and the Swagger UI for the controller, but the request times out eventually without bringing up the UI. I am a beginner in AWS Networking as well, but the security groups created by these setup instructions which I have followed exactly allows TCP requests from 0.0.0.0/0. Any inputs on bringing the UI up for the broker and server are much appreciated!

Kamal Chavda

07/09/2021, 3:24 PM

Hello! I am trying to figure out if there are any restrictions no column naming conventions for the schema file. Can it be snake_case or has to be camelCase?

Bruce Ritchie

07/09/2021, 6:27 PM

Hello all. Quick question before I install pinot - is it possible to alter a table to change indices on columns after data is loaded?

Bruce Ritchie

07/09/2021, 6:56 PM

For batch ingestion standalone mode where is the work being performed?

Bruce Ritchie

08/01/2021, 6:31 PM

Is it possible to add S3 deep storage after a cluster has ingested data? The documentation I found uses a s3 url for the controller.data.dir property which in my poc currently points to a directory on the fs on the controller.

xtrntr

08/10/2021, 2:30 AM

hello, im just curious how queries run faster when you call the exact same query the second time round. how does caching work in pinot? if it matters, i’m using the inverted index, with no partition pruning (only 1 broker running). broker looks something like this:

Copy code

Processed requestId=34,table=sorted_events_OFFLINE,segments(queried/processed/matched/consuming)=198/198/198/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=426,resSerMs=0,totalTimeMs=426,minConsumingFreshnessMs=-1,broker=Broker_172.26.0.4_8099,numDocsScanned=259467,scanInFilter=619119085,scanPostFilter=259467,sched=fcfs

Slow query: request handler processing time: 427, send response latency: 3, total time to handle request: 430
Processed requestId=35,table=events_OFFLINE,segments(queried/processed/matched/consuming)=198/198/118/-1,schedulerWaitMs=0,reqDeserMs=5,totalExecMs=221,resSerMs=0,totalTimeMs=226,minConsumingFreshnessMs=-1,broker=Broker_172.26.0.4_8099,numDocsScanned=657,scanInFilter=346815,scanPostFilter=657,sched=fcfs

also, i’m wondering what is considered prompts

"Slow query: …"

to show up in logs? does this mean pinot is suggesting that some optimization is possible to speed up my queries?

Tiger Zhao

08/16/2021, 3:19 PM

Hi, I'm trying to batch ingest a lot of data in some ORC files, what is the recommended way of doing this? I'm currently using the SegmentCreationAndMetadataPush job with the command line interface.

xtrntr

08/20/2021, 5:14 AM

am i missing something, but if you have

Copy code

# table1
<s3://bucket/pinot-segments/table1/>

# table2
<s3://bucket/pinot-segments/table2/>

do you need to tell controller where segments for each table? i only see

controller.data.dir

Tiger Zhao

08/20/2021, 2:50 PM

When using S3 as the deepstore with the SegmentCreationAndMetadataPush, should the

controller.data.dir

(from the controller conf) be the same as the

outputDirURI

(from the ingestion jobspec) ?

Tiger Zhao

08/24/2021, 9:26 PM

Is there a way to enable/disable individual segments?

Tiger Zhao

08/25/2021, 2:18 PM

What does the process look like for changing the table config for an existing table with segments in deepstore?

Tiger Zhao

08/26/2021, 7:16 PM

By default, is the broker supposed to limit queries to 10 results?

Thiago Pereira

08/28/2021, 12:51 PM

Does anyone have a good tutorial to help me?

J K

08/31/2021, 2:04 PM

I'm having issues running the basic scripts in the 0.8.0 apache pinot release with windows 7 and java 8 using git-bash-here as the terminal. I'm following this link for setup https://docs.pinot.apache.org/v/release-0.4.0/basics/getting-started/running-pinot-locally It seems like it can not find the java class files correctly. I currently only have JAVA_HOME set to pointed to the JDK8. I noticed in the 0.3.0 release notes there was something regarding java 8 (see image attached). Is there something special I need to do to get this to work?

Luis Fernandez

08/31/2021, 3:02 PM

question, if we have to scale pinot servers horizontally (where data is stored) do we rebalance the data the segments within that server host? how does that work?

Tiger Zhao

09/01/2021, 5:39 PM

Is there a way to specify the SegmentPush job to only push a single segment instead of a directory?

Tiger Zhao

09/02/2021, 9:49 PM

Is Pinot able to efficiently run queries that use REGEXP_LIKE? I'm not sure if there is any indexing or pre aggregations that would make that fast?

Luis Fernandez

09/07/2021, 4:38 PM

when we insert data into pinot how is replication achieved? is it when a segment is completed that we make this data available to other nodes?

xtrntr

09/08/2021, 2:52 AM

thats what i did, but it says in the document:

This should only be used in standalone setups or for POC.

Tiger Zhao

09/08/2021, 2:53 PM

If I set

enableDefaultStarTree=True

, is it possible to also specify extra aggregations in functionColumnPairs or change the maxLeafRecords (or any other config)? I think having it automatically generate and sort the dimensionSplitOrder list is very helpful but I also want to add more aggregations on top of the default.

Tiger Zhao

09/08/2021, 10:17 PM

Is pinot able to do PERCENTILEs and PERCENTILE aggregates in the star tree on columns with NULL values?

sina

09/10/2021, 6:51 AM

Hi team, I am working on a project for realtime speed test calculation. I get the speed test data from devices with kafka ingestion. Once they are in Pinot the following calculations need to be performed: - peak hour 7pm -11 pm data to be selected. - data comes in different time stamp, the average speed needs to be calculated every hour between 7 pm to 11 pm everyday E.g: 7-8 pm average , 8-9 average, 9-10 average and 10 -11 pm average. ( the average data for every hour should be available as soon as the 1 hour windows is completed ) - 4 average data needs to be stored into another table where we would have 4 sample data points per day. - from the second table past 14 days data need to be selected and 3rd worse speed should be reported and stored into another table. Both these two tables would be my reports. The question is if pinot is the suitable platform to do these sort of calculations ? What would be the best way to run ETL jobs or tasks for run the query to do the calculations ? I have already done this with InfluxDB, however I would like to design/implement this with Pinot. Note that I also have other use cases with the same data where I need the data to be reported on realtime. Thank you in advance for your help.

Tiger Zhao

09/10/2021, 8:24 PM

Does Pinot automatically delete the generated indices from the servers after deleting a segment? I'm running into an issue where I would delete the segment through the REST API, but it leaves the index files under PinotServer/index behind. The indices built up over time from my test tables and now the servers are out of disk space