I am getting error while uploading s3 data ``` Failed to gen Apache Pinot #troubleshooting

I am getting error while uploading s3 data ``` Fai...

sagar

02/11/2021, 8:29 AM

I am getting error while uploading s3 data

Copy code

Failed to generate Pinot segment for file s3:xxx/xxx/1234.csv
Illegal character in scheme name at index 2: table_OFFLINE_2021-02-01 09:39:00.000_2021-02-01 11:59:00.000_2.tar.gz
at java.net.URI.create(URI.java:852) ~[?:1.8.0_282]
	at java.net.URI.resolve(URI.java:1036) ~[?:1.8.0_282]
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:212) ~[pinot-batch-ingestion-standalone-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-162d0e61b6b1c3d51f915f7ad3e151a4fb24110a]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]

Xiang Fu

02/11/2021, 8:32 AM

what’s your

inputDirURI

and

outputDirURI

Xiang Fu

02/11/2021, 8:32 AM

seems there is a space in the segment name

Xiang Fu

02/11/2021, 8:33 AM

not sure if that’s allowed for uri

sagar

02/11/2021, 8:33 AM

inputDirURI: 's3://bucket/pinot/table/' includeFileNamePattern: 'glob:**/*.csv' outputDirURI: 's3://bucket/pinot/output_table/'

Xiang Fu

02/11/2021, 8:33 AM

that should be ok

Xiang Fu

02/11/2021, 8:33 AM

I think the issue is segment name

sagar

02/11/2021, 8:33 AM

is that space because of datetime field ?

Xiang Fu

02/11/2021, 8:33 AM

I think so

sagar

02/11/2021, 8:34 AM

format is "yyyy-MM-dd HHmmss.SSS"

Xiang Fu

02/11/2021, 8:34 AM

ah i c

sagar

02/11/2021, 8:34 AM

is this format not allowed with pinot ingestion?

Xiang Fu

02/11/2021, 8:35 AM

this time format is fine, just we put time min/max value in segment name which contains the space that triggered the bug

Xiang Fu

02/11/2021, 8:36 AM

in your ingestion job

Xiang Fu

02/11/2021, 8:36 AM

can you change the segmentNameGenerator

sagar

02/11/2021, 8:36 AM

what should I use for segmentNameGenerator

sagar

02/11/2021, 8:36 AM

Xiang Fu

02/11/2021, 8:37 AM

just try this:

Copy code

segmentNameGeneratorSpec:
  type: simple
  configs:

Xiang Fu

02/11/2021, 8:37 AM

meanwhile we will fix this bug

sagar

02/11/2021, 8:37 AM

ok sure

sagar

02/11/2021, 8:37 AM

also for controllerURI "I need to put the AWS Load balancer uri ?

Xiang Fu

02/11/2021, 8:39 AM

if you run it from the k8s cluster, then you can use the service name

pinot-controller:9000

Xiang Fu

02/11/2021, 8:40 AM

if it’s outside the k8s

Xiang Fu

02/11/2021, 8:40 AM

like from run the job from your laptop then AWS lb is required

sagar

02/11/2021, 8:42 AM

I am running from pinot-server pod ,with same it gave error

Copy code

Failed to read from Schema URI - 'pinot-controller:9000

sagar

02/11/2021, 8:44 AM

tried this

Copy code

segmentNameGeneratorSpec:
  type: simple
  configs:

Illegal character in scheme name at index 2: tabloe_OFFLINE_2021-02-01 082800.000_2021-02-01 105900.000_1.tar.gz same error

Xiang Fu

02/11/2021, 8:44 AM

<http://pinot-controller:9000>

✅ 1

sagar

02/11/2021, 8:45 AM

this uri worked , but same eror for Illegal character in scheme name at index 2:

Xiang Fu

02/11/2021, 8:46 AM

hmm

Xiang Fu

02/11/2021, 8:46 AM

can you try this:

Xiang Fu

02/11/2021, 8:47 AM

Copy code

segmentNameGeneratorSpec:
  type: fixed
  configs:
    segment.name: myTable_segment_0

Xiang Fu

02/11/2021, 8:52 AM

https://github.com/apache/incubator-pinot/pull/6571

sagar

02/11/2021, 9:01 AM

Also I am adding a field as Boolean it is converted to string and value is null , and INT field are showing negative values

Xiang Fu

02/11/2021, 9:08 AM

pinot internal use string to store boolean

Xiang Fu

02/11/2021, 9:08 AM

I think that is null value ?

sagar

02/11/2021, 9:10 AM

yes its showing null

sagar

02/11/2021, 9:10 AM

and in schema I see format as STRING

sagar

02/11/2021, 9:17 AM

Also I have uploaded data to Pinot, I can see in S3 the tar files, but in query editor only 10000 files which I uploaded intially are showing up

sagar

02/11/2021, 9:18 AM

is this expected? that it would have some delay ? beacuse there was no error on console from script

Xiang Fu

02/11/2021, 9:22 AM

hmm

Xiang Fu

02/11/2021, 9:22 AM

so you mean you have 10000 csv files?

Xiang Fu

02/11/2021, 9:22 AM

how many segments and total documents?

sagar

02/11/2021, 9:23 AM

no i mean for testing i tried with 3 csv it uploaded 10000 records, not I ran on buncho f more csvs

sagar

02/11/2021, 9:23 AM

now it is not showing new records

sagar

02/11/2021, 9:24 AM

still showing 10000 records and 1 segment

Xiang Fu

02/11/2021, 9:28 AM

Xiang Fu

02/11/2021, 9:28 AM

Xiang Fu

02/11/2021, 9:29 AM

for this you need to run 3 jobs each with different segment.name

Xiang Fu

02/11/2021, 9:29 AM

and point to each file

Xiang Fu

02/11/2021, 9:30 AM

pinot use segment name to distinguish each segment

Xiang Fu

02/11/2021, 9:30 AM

and will override each other if the segment name is the same

sagar

02/11/2021, 9:36 AM

I need to run number this times equal to number of folders in s3? with different values for configs: segment.name: ae_consent_flags_segment_0

sagar

02/11/2021, 9:36 AM

is this correct?

Xiang Fu

02/11/2021, 9:43 AM

yes

Xiang Fu

02/11/2021, 9:43 AM

each file should comes to one segment

Xiang Fu

02/11/2021, 9:43 AM

you can use inputPattern to pick the only file inside a directory

sagar

02/11/2021, 9:46 AM

inputPattern option in jobspec? I didnt find in documentation

sagar

02/11/2021, 9:46 AM

is there any option to set this automatically to use 1 segment for each file ?

Xiang Fu

02/11/2021, 9:53 AM

not for fixed segmentNameGenerator

Xiang Fu

02/11/2021, 9:56 AM

what’s your first

segmentNameGenerator

config?

sagar

02/11/2021, 9:58 AM

segmentNameGeneratorSpec: type: fixed configs: segment.name: table_segment_0

sagar

02/11/2021, 10:01 AM

i think "exclude.sequence.id" this would work ?

Xiang Fu

02/11/2021, 10:01 AM

Xiang Fu

02/11/2021, 10:01 AM

it’s for different usage

sagar

02/11/2021, 10:02 AM

Xiang Fu

02/11/2021, 10:02 AM

have you tried this

Copy code

# segmentNameGeneratorSpec: defines how to init a SegmentNameGenerator.
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    exclude.sequence.id: true

Xiang Fu

02/11/2021, 10:02 AM

also can I see your table config

Xiang Fu

02/11/2021, 10:03 AM

did you set

Copy code

"segmentPushType": "APPEND",

sagar

02/11/2021, 10:04 AM

yes

Copy code

{
  "OFFLINE": {
    "tableName": "name_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "timeColumnName": "uploaded",
      "segmentPushFrequency": "HOURLY",
      "segmentPushType": "APPEND",
      "schemaName": "name",
      "replication": "1",
      "replicasPerPartition": "1"
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "rangeIndexColumns": [],
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "noDictionaryColumns": [],
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false
    },
    "metadata": {},
    "quota": {},
    "routing": {},
    "query": {},
    "ingestionConfig": {},
    "isDimTable": false
  }
}

Xiang Fu

02/11/2021, 10:04 AM

Xiang Fu

02/11/2021, 10:04 AM

this should be good

sagar

02/11/2021, 10:05 AM

ok le tme try what you shared

Xiang Fu

02/11/2021, 10:05 AM

can you use this

normalizedDate

type

sagar

02/11/2021, 10:08 AM

yes trying with that

sagar

02/11/2021, 10:14 AM

I cleaned my s3 output folder and ran again , tar files are created , no error in script but in query editor still doc count is same 10834

Xiang Fu

02/11/2021, 10:17 AM

hmm

Xiang Fu

02/11/2021, 10:17 AM

how many segments created in your output s3 directory

sagar

02/11/2021, 10:20 AM

24 tar files in output dir , 1 correspoding to 1 csv

sagar

02/11/2021, 10:22 AM

now it loaded, when I clicked on reload segments, this reload segment didnt work last time

sagar

02/11/2021, 10:22 AM

is there a way to autoreload ?

sagar

02/11/2021, 10:23 AM

we have hierarchal s3 folders , I just ran for 2nd last level where folder contains file

Xiang Fu

02/11/2021, 10:32 AM

can you check table idealstates

Xiang Fu

02/11/2021, 10:32 AM

from the log, do you see all the segments are pushed ?

sagar

02/11/2021, 10:33 AM

logs just show

Copy code

2021/02/11 10:33:29.244 WARN [SegmentIndexCreationDriverImpl] [pool-2-thread-1] Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to read segment, ignoring configured file format: AVRO
2021/02/11 10:33:30.040 WARN [SegmentIndexCreationDriverImpl] [pool-2-thread-1] Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to

Xiang Fu

02/11/2021, 10:38 AM

can you try to change

jobType: SegmentCreationAndMetadataPush

Xiang Fu

02/11/2021, 10:38 AM

jobType: SegmentMetadataPush

Xiang Fu

02/11/2021, 10:38 AM

then rerun it

Xiang Fu

02/11/2021, 10:39 AM

it will just push segments from the output directory to pinot

sagar

02/11/2021, 10:59 AM

Open in Slack

Previous Next