Hey folks I noticed something strange while testing batch in Apache Pinot #troubleshooting

Hey folks, I noticed something strange while testi...

Diogo Baeder

04/16/2022, 10:12 PM

Hey folks, I noticed something strange while testing batch ingestion: apart from the normal segments created that I expect to be there in the tables I'm using, I end up with more segments with name patterns that seem to be using the batch run timestamp, and if I run the ingestion again on the same files as inputs, instead of the segments being kept as they were before (because there's no new file to be ingested), I end up with more of those segments with strange name patterns. Is this expected? The row counts don't change, and neither does the disk size taken by each table, it's just really the amount of segments that are increased somehow.

Diogo Baeder

04/16/2022, 10:13 PM

My table uses

DATE

columns, and each input file has data for one day, so I end up with expected segments containing these dates as part of them; But the unexpected segments use millisecond timestamps as part of their names. Not sure why.

Mayank

04/16/2022, 10:13 PM

Do you have any minion job setup?

Diogo Baeder

04/16/2022, 10:15 PM

Not currently, no. This happens every time I run the ingestion job - if I just leave the tables be, the segments don't change.

Mayank

04/16/2022, 10:17 PM

What ingestion mechanism are you using and what’s the config look like?

Diogo Baeder

04/16/2022, 10:19 PM

I'm using the admin command to ingest the files, and here's an example job I have:

Copy code

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/sensitive-data/outputs/weights'
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: '/tmp/data/segments/weights'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'json'
  className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableSpec:
  tableName: 'weights'
  schemaURI: '<http://localhost:9000/tables/weights/schema>'
pinotClusterSpecs:
  - controllerURI: '<http://localhost:9000>'

Diogo Baeder

04/16/2022, 10:21 PM

Some of the files I have as part of the input have no relevant data, they have JSON content but just with an empty list, so I wonder if this is what's causing that issue

Mayank

04/16/2022, 10:23 PM

Try count(*) group by $segmentName to see how many records in each segment, specifically the ones you don’t expect

Diogo Baeder

04/16/2022, 10:23 PM

Should I use a literal

$segmentName

? I never did a query like that...

Mayank

04/16/2022, 10:24 PM

Yes

Diogo Baeder

04/16/2022, 10:24 PM

Ah, cool! Nice to know that exists 🙂

Mayank

04/16/2022, 10:26 PM

You can also filter on segment name using that to check specific segment

Diogo Baeder

04/16/2022, 10:30 PM

Very interesting, and the result is what I expect: only the segments with expected names have rows in them, and after I re-run the ingestion, the select result is the same, only rows in the expected segments. So I guess my hypothesis is correct, it seems like those segments are being created without data and refering to files that have "no data" (just an empty list)

Diogo Baeder

04/16/2022, 10:31 PM

Yeah, that's it: I checked the metadata for the segments, and the unexpected ones are referring to input files with empty lists.

Diogo Baeder

04/16/2022, 10:32 PM

Is there a way to make the ingestion bypass such files instead of create empty segments?

Mayank

04/16/2022, 10:33 PM

Why do you have empty files? I think empty segment pushing was used as a work around to advance time boundary, if I am not wrong

Diogo Baeder

04/16/2022, 10:34 PM

Ah, got it... alright, that's fine too, I can just not create those empty files in the first place - just wanted to know if there could be a way to avoid ending up with those as segments, but of course it makes more sense to not have such files in the first place.

Diogo Baeder

04/17/2022, 5:35 AM

Hey, just a final feedback on this issue: now that I'm avoiding to generate empty files, the issue is gone, the segments are always the same no matter how many times I ingest the data, provided that the input files are the same. So, all good now 🙂

👍 1

Open in Slack

Previous Next