Hey folks, I noticed something strange while testi...
# troubleshooting
d
Hey folks, I noticed something strange while testing batch ingestion: apart from the normal segments created that I expect to be there in the tables I'm using, I end up with more segments with name patterns that seem to be using the batch run timestamp, and if I run the ingestion again on the same files as inputs, instead of the segments being kept as they were before (because there's no new file to be ingested), I end up with more of those segments with strange name patterns. Is this expected? The row counts don't change, and neither does the disk size taken by each table, it's just really the amount of segments that are increased somehow.
My table uses
DATE
columns, and each input file has data for one day, so I end up with expected segments containing these dates as part of them; But the unexpected segments use millisecond timestamps as part of their names. Not sure why.
m
Do you have any minion job setup?
d
Not currently, no. This happens every time I run the ingestion job - if I just leave the tables be, the segments don't change.
m
What ingestion mechanism are you using and what’s the config look like?
d
I'm using the admin command to ingest the files, and here's an example job I have:
Copy code
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/sensitive-data/outputs/weights'
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: '/tmp/data/segments/weights'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'json'
  className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableSpec:
  tableName: 'weights'
  schemaURI: '<http://localhost:9000/tables/weights/schema>'
pinotClusterSpecs:
  - controllerURI: '<http://localhost:9000>'
Some of the files I have as part of the input have no relevant data, they have JSON content but just with an empty list, so I wonder if this is what's causing that issue
m
Try count(*) group by $segmentName to see how many records in each segment, specifically the ones you don’t expect
d
Should I use a literal
$segmentName
? I never did a query like that...
m
Yes
d
Ah, cool! Nice to know that exists 🙂
m
You can also filter on segment name using that to check specific segment
d
Very interesting, and the result is what I expect: only the segments with expected names have rows in them, and after I re-run the ingestion, the select result is the same, only rows in the expected segments. So I guess my hypothesis is correct, it seems like those segments are being created without data and refering to files that have "no data" (just an empty list)
Yeah, that's it: I checked the metadata for the segments, and the unexpected ones are referring to input files with empty lists.
Is there a way to make the ingestion bypass such files instead of create empty segments?
m
Why do you have empty files? I think empty segment pushing was used as a work around to advance time boundary, if I am not wrong
d
Ah, got it... alright, that's fine too, I can just not create those empty files in the first place - just wanted to know if there could be a way to avoid ending up with those as segments, but of course it makes more sense to not have such files in the first place.
Hey, just a final feedback on this issue: now that I'm avoiding to generate empty files, the issue is gone, the segments are always the same no matter how many times I ingest the data, provided that the input files are the same. So, all good now 🙂
👍 1