Hi all I am writing this to explain the loop of problems tha Apache Pinot #general

Hi all, I am writing this to explain the loop of ...

ayush sharma

03/12/2021, 7:18 PM

Hi all, I am writing this to explain the loop of problems that we are facing while working on the architecture having Superset (v1.0.1), *Pinot*(latest docker image) and Presto (starburstdata/presto:350-e.3 docker image). Working around a problem in one framework causes problem in the other. I do not know which community can help me to solve this hence, posting it on both. Till now: We have successfully pushed 1 million records in a pinot table and would like to build charts on Superset over it. Problem # 1 We connected superset to pinot successfully and were able to build SQL lab queries only to find out that Superset does not support Exploring of SQL Lab virtual data as a chart if the connected database is Apache Pinot. (The "Explore" button is disabled) Please let me know, if this can be solved or we interpreted it incorrectly as it will solve the whole problem at once. To work it around, we got to know that superset - presto connection would enable this Explore button and we had implementation of presto any-which ways in our plan. So, we implemented Presto on top of pinot. Problem # 2 We found that Presto cannot aggregate pinot records of count more than 50k throwing error

Segment query returned '50001' rows per split, maximum allowed is '50000' rows. with query "SELECT * FROM pinot_table  LIMIT 50001"

Presto cannot even query something like this:

Copy code

presto:default> select count(*) from pinot.default.pinot_table;

Even, if we increase the 50k limit of presto's pinot.properties

pinot.max-rows-per-split-for-segment-queries

to 1 million, the presto server crashes stating heap memory exceeded. To work it around, we got to know that we can make pinot to do the aggregations and feed the aggregated result to presto which will in turn feed the superset to visualize the charts, by writing the aggregation logic inside the sub query of presto like,

Copy code

presto:default> select * from pinot.default."select count(*) from pinot_table"

This returns the expected result. Problem # 3 We found that, though we can make pinot to do the aggregations, we cannot use the supported transformation function of pinot listed here, inside the sub query of presto. The query

Copy code

select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10

works fine in pinot but when embedded in presto as sub query like below does not work

Copy code

presto:default> select * from pinot.default."select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10";
Query failed: Column datetrunc('day',epoch_ms_col,'milliseconds') not found in table default.select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10

I do not know if we are doing something wrong while querying/implementing or have missed some useful config setting that can solve our problem. The SQL Lab query which we want to query from pinot and eventually use the result to make a chart is like

Copy code

SELECT 
    day_of_week(epoch_ms_col),
    count(*)
from pinot_table
group by day_of_week(epoch_ms_col)

Any help is really appreciated !!!

Kishore G

03/12/2021, 7:24 PM

Thanks Ayush. Looking into it. can you file a github issue.. this is super detailed and can be helpful for other developers

Xiang Fu

03/12/2021, 9:03 PM

For #3, @Elon do you know if it’s supported?

Elon

03/12/2021, 9:21 PM

Hey, I wanted to try some things out, might have a pr to fix that if I can't find a workaround

Elon

03/12/2021, 9:21 PM

Already have one open that I can add to: https://github.com/trinodb/trino/pull/7161

Ron Kitay

03/16/2021, 6:29 PM

This thread might be relevant for my question - if I want to use

Pinot

to extract a large amount of data - what are the limitations? e.g., if I want to do something like:

Copy code

SELECT * from table where creationTime => x and creationTime<y

And save that output to a file (or files) - e.g. with the spark connector. What are the limits here? If the result is 2 TB of data, will that be supported?

Elon

03/16/2021, 6:31 PM

Not sure about the spark connector. With the trino connector there is a pr that will enable insert directly into a table - https://github.com/trinodb/trino/pull/7162 As long as each worker has enough memory to hold the segments you should be fine, i.e. worker_memory * workers > segment_memory * segments

Xiang Fu

01/26/2022, 6:29 PM

@Lvszn Peng

Open in Slack

Previous Next