https://pinot.apache.org/ logo
#troubleshooting
Title
# troubleshooting
d

Diogo Baeder

04/16/2022, 10:12 PM
Hey folks, I noticed something strange while testing batch ingestion: apart from the normal segments created that I expect to be there in the tables I'm using, I end up with more segments with name patterns that seem to be using the batch run timestamp, and if I run the ingestion again on the same files as inputs, instead of the segments being kept as they were before (because there's no new file to be ingested), I end up with more of those segments with strange name patterns. Is this expected? The row counts don't change, and neither does the disk size taken by each table, it's just really the amount of segments that are increased somehow.
My table uses
DATE
columns, and each input file has data for one day, so I end up with expected segments containing these dates as part of them; But the unexpected segments use millisecond timestamps as part of their names. Not sure why.
m

Mayank

04/16/2022, 10:13 PM
Do you have any minion job setup?
d

Diogo Baeder

04/16/2022, 10:15 PM
Not currently, no. This happens every time I run the ingestion job - if I just leave the tables be, the segments don't change.
m

Mayank

04/16/2022, 10:17 PM
What ingestion mechanism are you using and what’s the config look like?
d

Diogo Baeder

04/16/2022, 10:19 PM
I'm using the admin command to ingest the files, and here's an example job I have:
Copy code
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/sensitive-data/outputs/weights'
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: '/tmp/data/segments/weights'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'json'
  className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableSpec:
  tableName: 'weights'
  schemaURI: '<http://localhost:9000/tables/weights/schema>'
pinotClusterSpecs:
  - controllerURI: '<http://localhost:9000>'
Some of the files I have as part of the input have no relevant data, they have JSON content but just with an empty list, so I wonder if this is what's causing that issue
m

Mayank

04/16/2022, 10:23 PM
Try count(*) group by $segmentName to see how many records in each segment, specifically the ones you don’t expect
d

Diogo Baeder

04/16/2022, 10:23 PM
Should I use a literal
$segmentName
? I never did a query like that...
m

Mayank

04/16/2022, 10:24 PM
Yes
d

Diogo Baeder

04/16/2022, 10:24 PM
Ah, cool! Nice to know that exists 🙂
m

Mayank

04/16/2022, 10:26 PM
You can also filter on segment name using that to check specific segment
d

Diogo Baeder

04/16/2022, 10:30 PM
Very interesting, and the result is what I expect: only the segments with expected names have rows in them, and after I re-run the ingestion, the select result is the same, only rows in the expected segments. So I guess my hypothesis is correct, it seems like those segments are being created without data and refering to files that have "no data" (just an empty list)
Yeah, that's it: I checked the metadata for the segments, and the unexpected ones are referring to input files with empty lists.
Is there a way to make the ingestion bypass such files instead of create empty segments?
m

Mayank

04/16/2022, 10:33 PM
Why do you have empty files? I think empty segment pushing was used as a work around to advance time boundary, if I am not wrong
d

Diogo Baeder

04/16/2022, 10:34 PM
Ah, got it... alright, that's fine too, I can just not create those empty files in the first place - just wanted to know if there could be a way to avoid ending up with those as segments, but of course it makes more sense to not have such files in the first place.
Hey, just a final feedback on this issue: now that I'm avoiding to generate empty files, the issue is gone, the segments are always the same no matter how many times I ingest the data, provided that the input files are the same. So, all good now 🙂
👍 1