Hello! I’ve got a question related to simple use c...
# getting-started
Hello! I’ve got a question related to simple use case. Currently we have a Hadoop cluster for netflow ingestion ~ 320 TB data. Ingestion is from Kafka via Spark app directly to Hive (external table - simple parquet files). Searching in stored data is via Spark. Table is partitoned by hour but still we’re missing indexes. I’d like to replace current flow with Apache Pinot, but I’m not sure about segment store. We need to keep HDFS as a data backend and from documentation it seems like Pinot needs store data locally. We’re targeting to hybrid table, e.g. keep 1 hour from real time Kafka topis and older data to be pulled from HDFS. My questio is: a) real-time part of data need local disks - every Pinot server is holding a part of data from Kafka (consumer in group), right? b) hour + 1 data are stored “optimized” and indexed locally and pushed to HDFS? c) When I query data, current data are pulled from local semgment, older data are pulled in lazy fashion from HDFS/s3? d) is possible to host 200 TB table with ~ 12 columns (half nums, half strings) with @ 6 Pinot servers and get some benefit from indexes, just be more efficient than Spark with partition pruning?
Hi @User, welcome to the community:
Copy code
a) As of now, Pinot serving nodes store a local copy on the attached disk (both realtime as well as offline). The persistent storage can be in HDFS/S3 or similar such deepstore. For realtime, each Pinot server is assigned a sub-set of partitions from the topic to consume and store.
b) RT nodes periodically flush the in-memory index to persistent store (HDFS in your case). But note that it will need to maintain a copy in the local disk as well, for serving.
c) No, all data currently is local to the serving nodes.
d) 200TB size is in what format? As I mentioned, serving nodes need local storage to serve the data from.
If your concern is on cost of local storage, then you can also explore tiered storage: https://docs.pinot.apache.org/operators/operating-pinot/tiered-storage, where you can offload cold data to nodes with cheaper (HDD) local disk.
Thank you for clarification. 200 TB size is compressed parquet. So deepstore is strictly for backup/replication purposes, cannot be used for any type of lazy loading data for querying, right? Even if I’m OK with latency. In case when I’ve got server with 12 disks JBOD and 12 dirs/mounts, pinot server is able to split segments evenly to all drives?
I see, so Pinot segment might be around the same size. As of now, deep-store is strictly for backup purposes. Currently, the dataDir in pinot server is a single directory.
@Mayank Is there any updates on this? for using hdfs in tiered storage. I can see docs for s3 and GCP but none as configuring data on hdfs in queryable path
HDFS is only supported as a deep store, not a queryable path tiered storage yet.
ok thanks!