https://pinot.apache.org/ logo
#general
Title
# general
s

Sukesh Boggavarapu

03/29/2022, 11:58 AM
I got one more question regarding lookup tables. I have created an offline dimension only table with 3 records
m

Mark Needham

03/29/2022, 12:50 PM
if you navigate to that table, can you see what segments are listed?
s

Sukesh Boggavarapu

03/29/2022, 2:52 PM
There are 4 segments listed each with configuration like
Copy code
{
  "custom.map": "{\"input.data.file.uri\":\"file:/data/customers.csv\"}",
  "segment.crc": "3320463979",
  "segment.creation.time": "1648122612684",
  "segment.index.version": "v3",
  "segment.name": "merchants_OFFLINE_0",
  "segment.offline.download.url": "<http://172.28.0.4:9000/segments/merchants/merchants_OFFLINE_0>",
  "segment.offline.push.time": "1648122613365",
  "segment.table.name": "merchants",
  "segment.total.docs": "100",
  "segment.type": "OFFLINE"
}
m

Mark Needham

03/29/2022, 2:53 PM
what are the others called?
I expect some of them are probably invalid
but I dunno how they got there
s

Sukesh Boggavarapu

03/29/2022, 2:53 PM
FYI, I am running this in my docker local.
message has been deleted
I tried with a couple of offline tables...every table has some invalid data like this
m

Mark Needham

03/29/2022, 2:54 PM
in your ingestion file - it might be that the input directory has multiple CSV files
and it's created one segment per file
s

Sukesh Boggavarapu

03/29/2022, 2:55 PM
Oh.. It is possible that I have multiple csv files in the directory...but shouldn't it read only one with relevant file name?
m

Mark Needham

03/29/2022, 2:55 PM
it should do
s

Sukesh Boggavarapu

03/29/2022, 2:55 PM
I had the configuration like this:
Copy code
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/opt/pinot/data/members'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'members'
pinotClusterSpecs:
  - controllerURI: '<http://localhost:9000>'
m

Mark Needham

03/29/2022, 2:55 PM
*.csv will match every file though
includeFileNamePattern: 'glob:**/*.csv'
s

Sukesh Boggavarapu

03/29/2022, 2:56 PM
This is for an offline table called "members"
m

Mark Needham

03/29/2022, 2:56 PM
so it creates one segment per CSV file under /data/
I guess there are 4 files?
for other tables
s

Sukesh Boggavarapu

03/29/2022, 2:56 PM
I see... I did try specifying specific file name in that property..like
Copy code
includeFileNamePattern: 'glob:**/members.csv'
I believe ...it failed to read it ...
m

Mark Needham

03/29/2022, 2:57 PM
oh
s

Sukesh Boggavarapu

03/29/2022, 2:57 PM
But I can try once more to confirm that
m

Mark Needham

03/29/2022, 2:57 PM
it might be a bug somewhere if it's not reading the pattern properly
s

Sukesh Boggavarapu

03/29/2022, 2:58 PM
I will also try reading from a directory with only csv file and confirm if that works without any invalid data.
Yeah... I will post the stack trace in case of an error
Sorry for the delay in response. So, I created a separate folder for schema, table config and data files and ran the ingestion jobs.
docker exec -it pinot-controller /opt/pinot/bin/pinot-admin.sh AddTable -tableConfigFile /config/members/members_table.json -schemaFile /config/members/members_schema.json -exec
docker exec -it pinot-controller /opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /config/members/members_job-spec.yml
The data file has 8 records, but now the table has
24
records with duplicate data along with invalid data
message has been deleted
Data for the query
select * from member where merchant_id=123
:
You can see the data is saved twice along with that null domain_id row.
My job spec has input dir and file name pattern as
Copy code
inputDirURI: '/data/members'
includeFileNamePattern: 'glob:**/*.csv'
and there is only one file inside that
/data/members
which is
members.csv
with 8 records