Hi fellows, Have you guys seen any documentation ...
# general
l
Hi fellows, Have you guys seen any documentation about using PySpark to ingest into an offline table? if yes would be able to send me the link?
🌟 1
m
I personally have not, but am also curious if someone has tried it.
l
I really want to build something around this I've couple of people that want to try offline tables with Spark but they only have PySpark skillset no Java or Scala šŸ˜ž any thoughts?
k
can you share some more detail on how you would envision the workflow
if it makes it easier, we can file a github issue and continue the discussion there
l
Hi @Kishore G today we're loading data to one of ours customers following this process
basically, we're using PySpark to harmonize the data a deliver in parquet format so I can use the file system ingestion to put data into Pinot
I'm wondering if we could unlock the following use-case? I'm 100% sure that would widen the doors to have Data Engineers to test it out Spark + Pinot using SQL+PySpark capability!
the proposed workflow would be that one
• enable the capability to connect from PySpark straight to the Pinot cluster • that would imply in lot of overwork that we're currently having right now = compress parquet, deal with lifecycle management of files arriving on lake, evolution of the schemas, inability to load from dataframe to pinot using offline tables
not even that would reduce the hops but also it's going to speed up overall pipeline execution, if you want I can fill up the Git if you give me an example.
k
will read this and get back to you
l
alright see if does make sense and let me know