Syed Akram
04/29/2021, 8:50 AMXiang Fu
Syed Akram
04/29/2021, 9:19 AMXiang Fu
Xiang Fu
➜ cat examples/batch/jsontype/ingestionJobSpec.yaml
# executionFrameworkSpec: Defines ingestion jobs to be running.
executionFrameworkSpec:
# name: execution framework name
name: 'standalone'
# segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface.
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
# segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface.
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
# segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface.
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
# jobType: Pinot ingestion job type.
# Supported job types are:
# 'SegmentCreation'
# 'SegmentTarPush'
# 'SegmentUriPush'
# 'SegmentCreationAndTarPush'
# 'SegmentCreationAndUriPush'
jobType: SegmentCreationAndTarPush
# inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS.
inputDirURI: 'examples/batch/jsontype/rawdata'
# includeFileNamePattern: include file name pattern, supported glob pattern.
# Sample usage:
# 'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories;
# 'glob:**/*.avro' will include all the avro files under inputDirURI recursively.
includeFileNamePattern: 'glob:**/*.json'
# excludeFileNamePattern: exclude file name pattern, supported glob pattern.
# Sample usage:
# 'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories;
# 'glob:**/*.avro' will exclude all the avro files under inputDirURI recursively.
# _excludeFileNamePattern: ''
# outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS.
outputDirURI: 'examples/batch/jsontype/segments'
# overwriteOutput: Overwrite output segments if existed.
overwriteOutput: true
# pinotFSSpecs: defines all related Pinot file systems.
pinotFSSpecs:
- # scheme: used to identify a PinotFS.
# E.g. local, hdfs, dbfs, etc
scheme: file
# className: Class name used to create the PinotFS instance.
# E.g.
# org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem
# org.apache.pinot.plugin.filesystem.AzurePinotFS is used for Azure Data Lake
# org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS
className: org.apache.pinot.spi.filesystem.LocalPinotFS
# recordReaderSpec: defines all record reader
recordReaderSpec:
# dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.
dataFormat: 'json'
# className: Corresponding RecordReader class name.
# E.g.
# org.apache.pinot.plugin.inputformat.avro.AvroRecordReader
# org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
# org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader
# org.apache.pinot.plugin.inputformat.json.JSONRecordReader
# org.apache.pinot.plugin.inputformat.orc.ORCRecordReader
# org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader
className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
# configClassName: Corresponding RecordReaderConfig class name, it's mandatory for CSV and Thrift file format.
# E.g.
# org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
# org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReaderConfig
configClassName:
# configs: Used to init RecordReaderConfig class name, this config is required for CSV and Thrift data format.
configs:
# tableSpec: defines table name and where to fetch corresponding table config and table schema.
tableSpec:
# tableName: Table name
tableName: 'myTable'
# schemaURI: defines where to read the table schema, supports PinotFS or HTTP.
# E.g.
# <hdfs://path/to/table_schema.json>
# <http://localhost:9000/tables/myTable/schema>
schemaURI: '<http://localhost:9000/tables/myTable/schema>'
# tableConfigURI: defines where to reade the table config.
# Supports using PinotFS or HTTP.
# E.g.
# <hdfs://path/to/table_config.json>
# <http://localhost:9000/tables/myTable>
# Note that the API to read Pinot table config directly from pinot controller contains a JSON wrapper.
# The real table config is the object under the field 'OFFLINE'.
tableConfigURI: '<http://localhost:9000/tables/myTable>'
# pinotClusterSpecs: defines the Pinot Cluster Access Point.
pinotClusterSpecs:
- # controllerURI: used to fetch table/schema information and data push.
# E.g. <http://localhost:9000>
controllerURI: '<http://localhost:9000>'
# pushJobSpec: defines segment push job related configuration.
pushJobSpec:
# pushAttempts: number of attempts for push job, default is 1, which means no retry.
pushAttempts: 2
# pushRetryIntervalMillis: retry wait Ms, default to 1 second.
pushRetryIntervalMillis: 1000
Xiang Fu
➜ bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/jsontype/ingestionJobSpec.yaml
SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
cleanUpOutputDir: false
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
segmentMetadataPushJobRunnerClassName: null, segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**/*.json
inputDirURI: examples/batch/jsontype/rawdata
jobType: SegmentCreationAndTarPush
outputDirURI: examples/batch/jsontype/segments
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: '<http://localhost:9000>'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: {pushAttempts: 2, pushParallelism: 1, pushRetryIntervalMillis: 1000,
segmentUriPrefix: null, segmentUriSuffix: null}
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.json.JSONRecordReader,
configClassName: null, configs: null, dataFormat: json}
segmentCreationJobParallelism: 0
segmentNameGeneratorSpec: null
tableSpec: {schemaURI: '<http://localhost:9000/tables/myTable/schema>', tableConfigURI: '<http://localhost:9000/tables/myTable>',
tableName: myTable}
tlsSpec: null
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Creating an executor service with 1 threads(Job parallelism: 0, available cores: 16.)
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Submitting one Segment Generation Task for file:/Users/xiangfu/workspace/pinot-dev/pinot-distribution/target/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/examples/batch/jsontype/rawdata/data.json
Initialized FunctionRegistry with 119 functions: [fromepochminutesbucket, arrayunionint, codepoint, mod, sha256, year, yearofweek, upper, arraycontainsstring, arraydistinctstring, bytestohex, tojsonmapstr, trim, timezoneminute, sqrt, togeometry, normalize, fromepochdays, arraydistinctint, exp, jsonpathlong, yow, toepochhoursrounded, lower, toutf8, concat, ceil, todatetime, jsonpathstring, substr, dayofyear, contains, jsonpatharray, arrayindexofint, fromepochhoursbucket, arrayindexofstring, minus, arrayunionstring, toepochhours, toepochdaysrounded, millisecond, fromepochhours, arrayreversestring, dow, doy, min, toepochsecondsrounded, strpos, jsonpath, tosphericalgeography, fromepochsecondsbucket, max, reverse, hammingdistance, stpoint, abs, timezonehour, toepochseconds, arrayconcatint, quarter, md5, ln, toepochminutes, arraysortstring, replace, strrpos, jsonpathdouble, stastext, second, arraysortint, split, fromepochdaysbucket, lpad, day, toepochminutesrounded, fromdatetime, fromepochseconds, arrayconcatstring, base64encode, ltrim, arraysliceint, chr, sha, plus, base64decode, month, arraycontainsint, toepochminutesbucket, startswith, week, jsonformat, sha512, arrayslicestring, fromepochminutes, remove, dayofmonth, times, hour, rpad, arrayremovestring, now, divide, bigdecimaltobytes, floor, toepochsecondsbucket, toepochdaysbucket, hextobytes, rtrim, length, toepochhoursbucket, bytestobigdecimal, toepochdays, arrayreverseint, datetrunc, minute, round, dayofweek, arrayremoveint, weekofyear] in 733ms
Using class: org.apache.pinot.plugin.inputformat.json.JSONRecordReader to read segment, ignoring configured file format: AVRO
Finished building StatsCollector!
Collected stats for 4 documents
Using fixed length dictionary for column: subjects_grade, size: 20
Created dictionary for STRING column: subjects_grade with cardinality: 5, max length in bytes: 4, range: A to B--
Using fixed length dictionary for column: subjects_name, size: 5
Created dictionary for STRING column: subjects_name with cardinality: 1, max length in bytes: 5, range: maths to maths
Using fixed length dictionary for column: name, size: 20
Created dictionary for STRING column: name with cardinality: 4, max length in bytes: 5, range: Pete to Pete3
Created dictionary for LONG column: age with cardinality: 4, range: 23 to 26
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0 to v3 format
v3 segment location for segment: myTable_OFFLINE_0 is /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3
Deleting files in v1 segment directory: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0
Computed crc = 3500070607, based on files [/var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3/columns.psf, /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3/index_map, /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3/metadata.properties]
Driver, record read time : 3
Driver, stats collector time : 0
Driver, indexing time : 12
Tarring segment from: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0 to: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0.tar.gz
Size for segment: myTable_OFFLINE_0, uncompressed: 5.87K, compressed: 1.62K
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: [/Users/xiangfu/workspace/pinot-dev/pinot-distribution/target/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/examples/batch/jsontype/segments/myTable_OFFLINE_0.tar.gz]... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@6304101a] for table myTable
Pushing segment: myTable_OFFLINE_0 to location: <http://localhost:9000> for table myTable
Sending request: <http://localhost:9000/v2/segments?tableName=myTable> to controller: 192.168.86.73, version: Unknown
Response for pushing table myTable segment myTable_OFFLINE_0 to location <http://localhost:9000> - 200: {"status":"Successfully uploaded segment: myTable_OFFLINE_0 of table: myTable"}
Xiang Fu
Xiang Fu