I got one more question regarding lookup tables I have creat Apache Pinot #general

Join Slack

I got one more question regarding lookup tables. I...

# general

Sukesh Boggavarapu

03/29/2022, 11:58 AM

I got one more question regarding lookup tables. I have created an offline dimension only table with 3 records

Mark Needham

03/29/2022, 12:50 PM

if you navigate to that table, can you see what segments are listed?

Sukesh Boggavarapu

03/29/2022, 2:52 PM

There are 4 segments listed each with configuration like

Sukesh Boggavarapu

03/29/2022, 2:52 PM

Copy code

{
  "custom.map": "{\"input.data.file.uri\":\"file:/data/customers.csv\"}",
  "segment.crc": "3320463979",
  "segment.creation.time": "1648122612684",
  "segment.index.version": "v3",
  "segment.name": "merchants_OFFLINE_0",
  "segment.offline.download.url": "<http://172.28.0.4:9000/segments/merchants/merchants_OFFLINE_0>",
  "segment.offline.push.time": "1648122613365",
  "segment.table.name": "merchants",
  "segment.total.docs": "100",
  "segment.type": "OFFLINE"
}

Mark Needham

03/29/2022, 2:53 PM

what are the others called?

Mark Needham

03/29/2022, 2:53 PM

I expect some of them are probably invalid

Mark Needham

03/29/2022, 2:53 PM

but I dunno how they got there

Sukesh Boggavarapu

03/29/2022, 2:53 PM

FYI, I am running this in my docker local.

Sukesh Boggavarapu

03/29/2022, 2:53 PM

image.png

Sukesh Boggavarapu

03/29/2022, 2:54 PM

I tried with a couple of offline tables...every table has some invalid data like this

Mark Needham

03/29/2022, 2:54 PM

in your ingestion file - it might be that the input directory has multiple CSV files

Mark Needham

03/29/2022, 2:54 PM

and it's created one segment per file

Sukesh Boggavarapu

03/29/2022, 2:55 PM

Oh.. It is possible that I have multiple csv files in the directory...but shouldn't it read only one with relevant file name?

Mark Needham

03/29/2022, 2:55 PM

it should do

Sukesh Boggavarapu

03/29/2022, 2:55 PM

I had the configuration like this:

Sukesh Boggavarapu

03/29/2022, 2:55 PM

Copy code

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/opt/pinot/data/members'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'members'
pinotClusterSpecs:
  - controllerURI: '<http://localhost:9000>'

Mark Needham

03/29/2022, 2:55 PM

*.csv will match every file though

Mark Needham

03/29/2022, 2:56 PM

includeFileNamePattern: 'glob:**/*.csv'

Sukesh Boggavarapu

03/29/2022, 2:56 PM

This is for an offline table called "members"

Mark Needham

03/29/2022, 2:56 PM

so it creates one segment per CSV file under /data/

Mark Needham

03/29/2022, 2:56 PM

I guess there are 4 files?

Mark Needham

03/29/2022, 2:56 PM

for other tables

Sukesh Boggavarapu

03/29/2022, 2:56 PM

I see... I did try specifying specific file name in that property..like

Copy code

includeFileNamePattern: 'glob:**/members.csv'

Sukesh Boggavarapu

03/29/2022, 2:57 PM

I believe ...it failed to read it ...

Mark Needham

03/29/2022, 2:57 PM

Sukesh Boggavarapu

03/29/2022, 2:57 PM

But I can try once more to confirm that

Mark Needham

03/29/2022, 2:57 PM

it might be a bug somewhere if it's not reading the pattern properly

Sukesh Boggavarapu

03/29/2022, 2:58 PM

I will also try reading from a directory with only csv file and confirm if that works without any invalid data.

Sukesh Boggavarapu

03/29/2022, 2:58 PM

Yeah... I will post the stack trace in case of an error

Sukesh Boggavarapu

04/06/2022, 2:40 PM

Sorry for the delay in response. So, I created a separate folder for schema, table config and data files and ran the ingestion jobs.

Sukesh Boggavarapu

04/06/2022, 2:40 PM

docker exec -it pinot-controller /opt/pinot/bin/pinot-admin.sh AddTable -tableConfigFile /config/members/members_table.json -schemaFile /config/members/members_schema.json -exec

Sukesh Boggavarapu

04/06/2022, 2:41 PM

docker exec -it pinot-controller /opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /config/members/members_job-spec.yml

Sukesh Boggavarapu

04/06/2022, 2:42 PM

The data file has 8 records, but now the table has

records with duplicate data along with invalid data

Sukesh Boggavarapu

04/06/2022, 2:42 PM

image.png

Sukesh Boggavarapu

04/06/2022, 2:43 PM

Data for the query

select * from member where merchant_id=123

Sukesh Boggavarapu

04/06/2022, 2:44 PM

You can see the data is saved twice along with that null domain_id row.

Sukesh Boggavarapu

04/06/2022, 3:05 PM

My job spec has input dir and file name pattern as

Copy code

inputDirURI: '/data/members'
includeFileNamePattern: 'glob:**/*.csv'

Sukesh Boggavarapu

04/06/2022, 3:06 PM

and there is only one file inside that

/data/members

which is

members.csv

with 8 records

Open in Slack

Previous Next