I am getting error while uploading s3 data ``` Fai...
# troubleshooting
s
I am getting error while uploading s3 data
Copy code
Failed to generate Pinot segment for file s3:xxx/xxx/1234.csv
Illegal character in scheme name at index 2: table_OFFLINE_2021-02-01 09:39:00.000_2021-02-01 11:59:00.000_2.tar.gz
at java.net.URI.create(URI.java:852) ~[?:1.8.0_282]
	at java.net.URI.resolve(URI.java:1036) ~[?:1.8.0_282]
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:212) ~[pinot-batch-ingestion-standalone-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-162d0e61b6b1c3d51f915f7ad3e151a4fb24110a]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
x
what’s your
inputDirURI
and
outputDirURI
?
seems there is a space in the segment name
not sure if that’s allowed for uri
s
inputDirURI: 's3://bucket/pinot/table/' includeFileNamePattern: 'glob:**/*.csv' outputDirURI: 's3://bucket/pinot/output_table/'
x
that should be ok
I think the issue is segment name
s
is that space because of datetime field ?
x
I think so
s
format is "yyyy-MM-dd HHmmss.SSS"
x
ah i c
s
is this format not allowed with pinot ingestion?
x
this time format is fine, just we put time min/max value in segment name which contains the space that triggered the bug
in your ingestion job
can you change the segmentNameGenerator
s
what should I use for segmentNameGenerator
?
x
just try this:
Copy code
segmentNameGeneratorSpec:
  type: simple
  configs:
meanwhile we will fix this bug
s
ok sure
also for controllerURI "I need to put the AWS Load balancer uri ?
x
if you run it from the k8s cluster, then you can use the service name
pinot-controller:9000
if it’s outside the k8s
like from run the job from your laptop then AWS lb is required
s
I am running from pinot-server pod ,with same it gave error
Copy code
Failed to read from Schema URI - 'pinot-controller:9000
tried this
Copy code
segmentNameGeneratorSpec:
  type: simple
  configs:
Illegal character in scheme name at index 2: tabloe_OFFLINE_2021-02-01 082800.000_2021-02-01 105900.000_1.tar.gz same error
x
<http://pinot-controller:9000>
?
1
s
this uri worked , but same eror for Illegal character in scheme name at index 2:
x
hmm
can you try this:
Copy code
segmentNameGeneratorSpec:
  type: fixed
  configs:
    segment.name: myTable_segment_0
s
Also I am adding a field as Boolean it is converted to string and value is null , and INT field are showing negative values
x
pinot internal use string to store boolean
I think that is null value ?
s
yes its showing null
and in schema I see format as STRING
Also I have uploaded data to Pinot, I can see in S3 the tar files, but in query editor only 10000 files which I uploaded intially are showing up
is this expected? that it would have some delay ? beacuse there was no error on console from script
x
hmm
so you mean you have 10000 csv files?
how many segments and total documents?
s
no i mean for testing i tried with 3 csv it uploaded 10000 records, not I ran on buncho f more csvs
now it is not showing new records
still showing 10000 records and 1 segment
x
ic
ah
for this you need to run 3 jobs each with different segment.name
and point to each file
pinot use segment name to distinguish each segment
and will override each other if the segment name is the same
s
I need to run number this times equal to number of folders in s3? with different values for configs: segment.name: ae_consent_flags_segment_0
is this correct?
x
yes
each file should comes to one segment
you can use inputPattern to pick the only file inside a directory
s
inputPattern option in jobspec? I didnt find in documentation
is there any option to set this automatically to use 1 segment for each file ?
x
not for fixed segmentNameGenerator
what’s your first
segmentNameGenerator
config?
s
segmentNameGeneratorSpec: type: fixed configs: segment.name: table_segment_0
i think "exclude.sequence.id" this would work ?
x
no
it’s for different usage
s
ok
x
have you tried this
Copy code
# segmentNameGeneratorSpec: defines how to init a SegmentNameGenerator.
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    exclude.sequence.id: true
also can I see your table config
did you set
Copy code
"segmentPushType": "APPEND",
s
yes
Copy code
{
  "OFFLINE": {
    "tableName": "name_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "timeColumnName": "uploaded",
      "segmentPushFrequency": "HOURLY",
      "segmentPushType": "APPEND",
      "schemaName": "name",
      "replication": "1",
      "replicasPerPartition": "1"
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "rangeIndexColumns": [],
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "noDictionaryColumns": [],
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false
    },
    "metadata": {},
    "quota": {},
    "routing": {},
    "query": {},
    "ingestionConfig": {},
    "isDimTable": false
  }
}
x
ok
this should be good
s
ok le tme try what you shared
x
can you use this
normalizedDate
type
s
yes trying with that
I cleaned my s3 output folder and ran again , tar files are created , no error in script but in query editor still doc count is same 10834
x
hmm
how many segments created in your output s3 directory
s
24 tar files in output dir , 1 correspoding to 1 csv
now it loaded, when I clicked on reload segments, this reload segment didnt work last time
is there a way to autoreload ?
we have hierarchal s3 folders , I just ran for 2nd last level where folder contains file
x
can you check table idealstates
from the log, do you see all the segments are pushed ?
s
logs just show
Copy code
2021/02/11 10:33:29.244 WARN [SegmentIndexCreationDriverImpl] [pool-2-thread-1] Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to read segment, ignoring configured file format: AVRO
2021/02/11 10:33:30.040 WARN [SegmentIndexCreationDriverImpl] [pool-2-thread-1] Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to
x
can you try to change
jobType: SegmentCreationAndMetadataPush
to
jobType: SegmentMetadataPush
then rerun it
it will just push segments from the output directory to pinot
s
ok