A few related questions about Pinot ingesting data...
# troubleshooting
m
A few related questions about Pinot ingesting data from a Google Cloud Bucket. Background: We were able to read in ~3 years of data from our production system that Pinot stored in 9800 ".avro" files (each is about 400 MB, so about 4 TB in total). That adds up to 2.5 billion transactions, and we were interested in measuring how the system would perform with 100 billion transactions present. Over this 3-day weekend, I created 39 copies of the .avro files (each with a different multiple of 3 years of a time shift), so that all 39 copies had unique values for the timestamp (there is just one dimension column - that time stamp - for the table in question). Over the weekend, my loop created new (gzipped) avro files with a for each record, and uploaded the 39 * 9800 ".avro.gz" files into our Google Cloud bucket that minion processes ingest from. We're only seeing a very small number of error messages, and yet very little of the generated data is present. Pinot is only seeing ~4.4 billion transactions (not even one full copy, much less 39 full copies) and it's not clear how to track down how far along the ingestion process is moving along (inside of Pinot) -- is there a way to see a list of still-pending .avro files (i.e. Pinot has seen that a new file was uploaded, but it hasn't started to import the contents yet)? (It may just be that I just don't yet know where to look.)
1
h
just want to double check, the ingestion is finished, right? nothing in progress?
✔️ 1
One thing concerns me is the number of files, we have 400k files in total. The batch ingestion task list all files when it tries to generate tasks (before running those tasks). I am afraid that we cannot list them correctly (either partial result or time out)
We might need to improve the files listing logic to avoid this ^ problem if it's the root cause
One thing we might be able to try it to check if that ^ is the problem or not is to (1) split files into folders (say 39 folders, and each time, we only ingest one folder). (2) specify a folder to ingestion each time, instead of ingesting all the files at once (i.e., change
Copy code
"inputDirURI": "<gs://pinot-ingestion/transaction>",
to finer grainularity folders)
m
this is a good idea; I will consult with my teammates to see if this will work for us.