https://pinot.apache.org/ logo
#general
Title
# general
a

ayush sharma

03/12/2021, 7:18 PM
Hi all, I am writing this to explain the loop of problems that we are facing while working on the architecture having Superset (v1.0.1), *Pinot*(latest docker image) and Presto (starburstdata/presto:350-e.3 docker image). Working around a problem in one framework causes problem in the other. I do not know which community can help me to solve this hence, posting it on both. Till now: We have successfully pushed 1 million records in a pinot table and would like to build charts on Superset over it. Problem # 1 We connected superset to pinot successfully and were able to build SQL lab queries only to find out that Superset does not support Exploring of SQL Lab virtual data as a chart if the connected database is Apache Pinot. (The "Explore" button is disabled) Please let me know, if this can be solved or we interpreted it incorrectly as it will solve the whole problem at once. To work it around, we got to know that superset - presto connection would enable this Explore button and we had implementation of presto any-which ways in our plan. So, we implemented Presto on top of pinot. Problem # 2 We found that Presto cannot aggregate pinot records of count more than 50k throwing error
Segment query returned '50001' rows per split, maximum allowed is '50000' rows. with query "SELECT * FROM pinot_table  LIMIT 50001"
Presto cannot even query something like this:
Copy code
presto:default> select count(*) from pinot.default.pinot_table;
Even, if we increase the 50k limit of presto's pinot.properties
pinot.max-rows-per-split-for-segment-queries
to 1 million, the presto server crashes stating heap memory exceeded. To work it around, we got to know that we can make pinot to do the aggregations and feed the aggregated result to presto which will in turn feed the superset to visualize the charts, by writing the aggregation logic inside the sub query of presto like,
Copy code
presto:default> select * from pinot.default."select count(*) from pinot_table"
This returns the expected result. Problem # 3 We found that, though we can make pinot to do the aggregations, we cannot use the supported transformation function of pinot listed here, inside the sub query of presto. The query
Copy code
select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10
works fine in pinot but when embedded in presto as sub query like below does not work
Copy code
presto:default> select * from pinot.default."select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10";
Query failed: Column datetrunc('day',epoch_ms_col,'milliseconds') not found in table default.select datetrunc('day', epoch_ms_col, 'milliseconds') from pinot_table limit 10
I do not know if we are doing something wrong while querying/implementing or have missed some useful config setting that can solve our problem. The SQL Lab query which we want to query from pinot and eventually use the result to make a chart is like
Copy code
SELECT 
    day_of_week(epoch_ms_col),
    count(*)
from pinot_table
group by day_of_week(epoch_ms_col)
Any help is really appreciated !!!
k

Kishore G

03/12/2021, 7:24 PM
Thanks Ayush. Looking into it. can you file a github issue.. this is super detailed and can be helpful for other developers
x

Xiang Fu

03/12/2021, 9:03 PM
For #3, @Elon do you know if it’s supported?
e

Elon

03/12/2021, 9:21 PM
Hey, I wanted to try some things out, might have a pr to fix that if I can't find a workaround
Already have one open that I can add to: https://github.com/trinodb/trino/pull/7161
r

Ron Kitay

03/16/2021, 6:29 PM
This thread might be relevant for my question - if I want to use
Pinot
to extract a large amount of data - what are the limitations? e.g., if I want to do something like:
Copy code
SELECT * from table where creationTime => x and creationTime<y
And save that output to a file (or files) - e.g. with the spark connector. What are the limits here? If the result is 2 TB of data, will that be supported?
e

Elon

03/16/2021, 6:31 PM
Not sure about the spark connector. With the trino connector there is a pr that will enable insert directly into a table - https://github.com/trinodb/trino/pull/7162 As long as each worker has enough memory to hold the segments you should be fine, i.e. worker_memory * workers > segment_memory * segments
x

Xiang Fu

01/26/2022, 6:29 PM
@Lvszn Peng