Apache Pinot #general

05/31/2021, 5:45 AM

Hello Everyone, I am creating one hybrid which can ingest data from Kafka topic(streaming data) as well as from hdfs Location(batch ingestion). I am aware about stream ingestion process to ingest data from Kafka topic and have created multiple realtime tables. Now I am creating one hybrid table for one of the Kafka topic data is also available at hdfs location for the same topic. I am going through the documents but in offline-config-table.json I couldn't find any properties where we are passing source location as hdfs location.Kindly suggest what is the process to ingest from hdfs also in same table.

Sávio Salvarino Teles de Oliveira

06/01/2021, 12:24 AM

Hi! I have two dimensions (customers and sellers) with a fact table with order data. We would like to aggregate the order data by customers and sellers, such as aggregate order amount. We would like to use the Star-tree index, but, the customer can change at any time (name, address, etc) and in the Pinot documentation it says that it does not accept upsert using Star-tree index (https://docs.pinot.apache.org/basics/data-import/upsert#limitations …). What would be the best solution using Pinot?

Kaustabh Ganguly

06/01/2021, 2:24 PM

I'm a fresh CS grad and just exploring things. I am new to streaming data, kafka and pinot. I want to merge batched data and streaming data and use pinot on top of it. My solution is to use Kafka connect as it's an ideal solution for merging batched and streaming data into topics & partitions. So my pipeline is basically using kafka for merging and then using pinot for streaming from kafka. Is there a better solution that comes across anyone's mind ? Please correct me if there's any fallacy in my logic.

Mayank

06/01/2021, 3:27 PM

I’m the incoming null gets translated into default null value and stored in Pinot. So in your example, “default” will be stored

Ken Krugler

06/02/2021, 12:27 AM

My ops guy is setting up Docker containers, and wants to know why the base Pinot Dockerfile has

Copy code

VOLUME ["${PINOT_HOME}/configs", "${PINOT_HOME}/data"]

since he sees that there’s nothing being stored in the

/data

directory. Any input?

troywinter

06/02/2021, 3:17 AM

I’m getting slow regexp_like performance, for 0.3 billion rows, it is costing nearly 2 secs to match a prefix for a column, but in Druid, the same data using

like

operator returned instantly. Is there any configs I can apply to speed up this kind of query?

Lakshmanan Velusamy

06/03/2021, 5:53 AM

Hello Community, We are trying to add user defined scalar functions. Can these functions be used in star tree index for pre-aggregation ?

Jonathan Meyer

06/03/2021, 11:51 AM

Hello 🙂 Is there a whirlwind tour of Pinot's code base available somewhere ? Some pointers on where to start ?

Sávio Salvarino Teles de Oliveira

06/03/2021, 3:16 PM

Hello. What happens when upsert during the real-time ingestion with primary key and event time equals? The documentation says: "When two records of the same primary key are ingested, the record with the greater event time (as defined by the time column) is used.". But when there is a tie, what happens?

Pankaj Thakkar

06/05/2021, 9:56 PM

Thanks for the links @User; @User awesome job on the segment lifecycle videos!

Santhosh CT

06/08/2021, 3:59 AM

Hi. We have a usecase to store the incoming user events. We have multiple dimensions where we want to query on. We want to use S3 as deep storage. We also have requirements like, the last half hour data will be queried on frequently like a hot shard. Can we use pinot for this use case? How can we model data optimally for this kind of use case? Do we have data retention support where data older than that can be removed after some time?

Jai Patel

06/08/2021, 11:44 PM

For an upsert table I have the order columns: timeColumnName set to my updated_at timestamp. It used to be created_at when I was using an offline-only table. I believe this is the correct change. My question is for the sortedcolumn index, do I need to change it too? For my use case I generally still want to be sorting on created_at. But does upsert required the sortedcolumn be the same as the timecolumn?

Alon Burg

06/09/2021, 11:00 AM

Is there a way to query the result of the startree-index for time periods? I guess this type of query is probably executed by ThirdEye?

06/09/2021, 4:06 PM

Is there any way to increase server memory while starting pinot server in cluster-mode. I have my server on 2 different nodes , whenever I am trying to refresh my superset dashboard it's firing some queries to pinot and fetching data from server. So one of my server automatically showing dead state when I checked the log so it's showing there is insufficient memory for the Java runtime environment to continue and server stopped working there. Is there any way to resolve this issue. @User @User @User

Pedro Silva

06/09/2021, 4:43 PM

Hello, What is the difference between

segmentsConfig.replication

segmentsConfig.replicasPerPartition

for a realtime table?

Map

06/09/2021, 9:13 PM

For stream ingestion with Kafka, only JSON format is currently supported right? The input formatslisted here https://docs.pinot.apache.org/basics/data-import/pinot-input-formats are only for batch ingestion?

Alon Burg

06/10/2021, 9:18 AM

In the article

Pinot: Realtime OLAP for 530 Million Users

it says

Copy code

At Linkedin, business events are published in Kafka streams and
are ETL'ed onto HDFS. Pinot supports near-realtime data ingestion by reading events directly from Kafka [19] as well as data
pushes from offline systems like Hadoop. As such, Pinot follows
the lambda architecture [23], transparently merging streaming data
from Kafka and offline data from Hadoop. As data on Hadoop is a
global view of a single hour or day of data as opposed to a direct
stream of events, it allows for the generation of more optimal segments and aggregation of records across the time window.

Is there a general rule of thumb of when should I keep raw events in Pinot vs aggregated data?

Carl

06/10/2021, 11:39 AM

Hi, does current Pinot python client support basic auth for querying Pinot? Is there an example showing how to pass the auth header with python client? Thanks.

06/10/2021, 6:07 PM

Is there any way to query Pinot table directly from superset without using prestro as a middleware. I.e. To access pinot table through superset I am using pinot prestro connector than in superset I am using this catalog to connect from table so basically whenever I m firing some queries from superset it's going to pinot with the help of prestro. Since I am not using any joins in query so I believe I can also connect directly superset and pinot without using prestro as a middleware. So I think this way queries will be fast. @User kindly suggest.

Carl

06/10/2021, 9:13 PM

In default, Pinot return 10 rows when query select *, is there a way to change and remove this default limit?

06/11/2021, 8:17 AM

Hi everyone , How to convert a string into double in pinot with sum.functiom .I have tried these 2 queries but getting this error. In my pinot schema file I have data type as string for this field so I m not able to take sum. While wring the transformation in config file I have assign default value as null for this field so it has both non null values and null where data is not available.i guess it is not able to cast the null values into double /decimal. Any way to ignore nulls . I have tried where gross_amount is not null but didn't work. Kindly suggest

Gagandeep Singh

06/11/2021, 2:47 PM

Hello guys👋 For my studies, I am talking about Pinot and its architecture. I created an activity Diagram for demonstrating the query process within the Cluster, but I think some things are missing. I read the original Paper and orient myself on the Query process section. Unfortunately, I was not fully capable of illustrating it. I would appreciate it if some experts could give me some feedback. Thank you very much!

Aaron Wishnick

06/11/2021, 3:45 PM

I'm trying to understand the difference between Segment URI Push and Segment Metadata Push. I was using Segment URI Push and I filled up the disk on the controller. That seems to make sense to me since the controller had to download all the segments. A couple related questions: 1. If I use metadata push, my understanding is that the controller will direct one of the servers to download the segment instead, is that right? 2. Does that mean the controller will use less disk in that case? 3. Is the final state after URI Push and Metadata Push different? I'd assume in both cases, you should end up with segments distributed across servers, is that right? So I'm just curious why the controller's disk filled up, is it supposed to clean up and isn't doing that, or is this behavior expected?

Punish Garg

06/11/2021, 5:07 PM

Hello Team, i wanted to understand one thing does pinot provide capability of overwrite any segment data ( like we do overwrite partitioned data in hive table)

Ashish

06/13/2021, 2:53 AM

Is this issue resolved - https://github.com/apache/incubator-pinot/issues/5261?

Hamza

06/14/2021, 3:01 PM

Hello, I'm runnning Pinot in K8s and I have a job that creates my table, my schema and another job that does the ingestion from a GCS storage. This last job creates segments and store them in a GCS bucket. Is there a way for later runs to load these segments directly from the folder without recreating them ?

Jai Patel

06/14/2021, 10:55 PM

What’s the option called to disable upsert?

Chundong Wang

06/14/2021, 11:01 PM

Is there any document on how theta-sketch columns should be generated? In the Pinot doc of

DistinctCountThetaSketch

it mentioned

thetaSketchColumn

. Is that column supposed to be serialized binary (hex string I suppose) of Theta Sketch framework?

Copy code

UpdateSketch sketch2 = UpdateSketch.builder().build();
  for (int key = 50000; key < 150000; key++) sketch2.update(key);
  FileOutputStream out2 = new FileOutputStream("ThetaSketch2.bin");
  out2.write(sketch2.compact().toByteArray()); // or hexString()

Pedro Silva

06/15/2021, 11:23 AM

Hello, Has anyone succesfully configured Pinot to work in Trino in a Kubernetes environment? Following their documentation, they mention that

The Pinot broker and server must be accessible via DNS as Pinot returns hostnames and not IP addresses.

, does this mean the actual pods or the services? Can someone share what their configurations look like? I've tried the trino slack unsuccessfully...

Milan Bracke

06/16/2021, 7:42 AM

Hi! Is there a way to write a where clause to match entries that do not match a given regular expression? Using

not

just results in an error message.