Sumit Khaitan
11/01/2022, 12:23 PMCan Pinot directly read and ingest those minutely files from Azure Blob Storage or there has to be a Spark/ETL pipeline that needs to ingest the data to Pinot ?Luis Fernandez
11/01/2022, 2:20 PMMayank
Seunghyun
11/11/2022, 4:03 PMSumit Khaitan
11/13/2022, 8:36 AMSeunghyun
11/14/2022, 6:55 PMSumit Khaitan
11/15/2022, 12:15 PM2022/11/15 11:11:14.605 WARN [PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread_5] Caught exception while fetching segment from: adl2:/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz to: /var/pinot/server/data/index/table_OFFLINE/tmp/tmp-testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json-f4d9175f-ac16-4354-94aa-2cc466987e0c/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz
com.azure.storage.blob.models.BlobStorageException: Status code 400, "<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:ff0f8bb8-101e-0008-70e2-f82663000000
Time:2022-11-15T11:11:14.6025637Z</Message><UriPath><https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz></UriPath></Error>"
at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:?]
at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
at java.lang.reflect.Constructor.newInstance(Constructor.java:490) ~[?:?]
at com.azure.core.http.rest.RestProxy.instantiateUnexpectedException(RestProxy.java:390) ~[pinot-adls-0.12.0-SNAPSHOT-shaded.jar:0.12.0-SNAPSHOT-78504b941331681a8b5bd27e37c176a97e5bbcca]
Job spec File
executionFrameworkSpec:
name: 'spark'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
extraConfigs:
stagingDir: adl2:///testing/SegmentCreationAndMetadataPush/staging/
jobType: SegmentCreationAndMetadataPush
inputDirURI: <adl2://testing/SegmentCreationAndMetadataPush/input/>
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: adl2:///testing/SegmentCreationAndMetadataPush/output/
overwriteOutput: true
pinotFSSpecs:
- scheme: adl2
className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
configs:
accountName: 'AZURE_ACCOUNT_NAME'
accessKey: 'AZURE_ACCOUNT_KEY'
fileSystemName: 'AZURE_CONTAINER'
recordReaderSpec:
dataFormat: 'json'
className: 'org.apache.pinot.core.data.readers.JSONRecordReader'
tableSpec:
tableName: 'TABLE_NAME'
schemaURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME/schema>'
tableConfigURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME>'
segmentNameGeneratorSpec:
type: inputFile
configs:
file.path.pattern: '.+/(.+)\.gz'
segment.name.template: '\${filePathPattern:\1}'
segment.name.prefix: 'batch'
exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: '<http://CONTROLLER_IP:9000>'
pushJobSpec:
pushParallelism: 2
pushAttempts: 2
Extra config in pinot-server.conf
pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.server.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.server.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.server.segment.fetcher.protocols=file,http,adl2
pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Extra config in pinot-controller.conf
controller.data.dir=<adl2://testing/deep-store/>
controller.local.temp.dir=/var/pinot/controller/data-temp
controller.enable.split.commit=true
pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.controller.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.controller.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.controller.segment.fetcher.protocols=file,http,adl2
pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcherSumit Khaitan
11/15/2022, 12:24 PMSeunghyun
11/15/2022, 5:42 PMADLS Gen2 connector to have the access? This won't work well because we currently don't have the official support for ABS at the moment. For testing purpose, pls try to provision ADLS Gen2 storage and try to hook it as the deep storage.
Assuming that you already this, can you search for the following string?
ADLSGen2PinotFS is initialized
https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-file-system/pinot-[…]in/java/org/apache/pinot/plugin/filesystem/ADLSGen2PinotFS.java
Also, from your exception log, this part is suspicious
<https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz>
It looks that AZURE_STORAGE_ACCOUNT needs to be resolved to the environment variable value that you passed.Seunghyun
11/15/2022, 5:42 PMSumit Khaitan
11/15/2022, 6:20 PM<https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz>
Regarding this. Actually AZURE_STORAGE_ACCOUNT is getting resolved correctly, I have just replaced the name. What looks suspicious to me is that its not resolving the container name from the fileSystemName and using $root container.
Assuming that you already this, can you search for the following string? -> In pinot-server ??
@SeunghyunSumit Khaitan
11/15/2022, 6:25 PMpinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.server.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.server.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.server.segment.fetcher.protocols=file,http,adl2
pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcherSumit Khaitan
11/15/2022, 6:28 PMSeunghyun
11/15/2022, 6:33 PMSeunghyun
11/15/2022, 6:33 PMSumit Khaitan
11/15/2022, 6:37 PMSumit Khaitan
11/15/2022, 6:57 PMSeunghyun
11/15/2022, 8:18 PMSeunghyun
11/15/2022, 8:19 PMSeunghyun
11/15/2022, 9:14 PMSeunghyun
11/15/2022, 10:08 PMSumit Khaitan
11/16/2022, 6:58 AMSumit Khaitan
11/16/2022, 8:31 AMSeunghyun
11/16/2022, 8:42 AMSeunghyun
11/16/2022, 8:46 AMSumit Khaitan
11/16/2022, 8:48 AMSumit Khaitan
11/16/2022, 8:49 AMexecutionFrameworkSpec:
name: 'spark'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
extraConfigs:
stagingDir: adl2:///testing/SegmentCreationAndMetadataPush/staging/
jobType: SegmentCreationAndMetadataPush
inputDirURI: <adl2://testing/SegmentCreationAndMetadataPush/input/>
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: adl2:///testing/SegmentCreationAndMetadataPush/output/
overwriteOutput: true
pinotFSSpecs:
- scheme: adl2
className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
configs:
accountName: 'AZURE_ACCOUNT_NAME'
accessKey: 'AZURE_ACCOUNT_KEY'
fileSystemName: 'AZURE_CONTAINER'
recordReaderSpec:
dataFormat: 'json'
className: 'org.apache.pinot.core.data.readers.JSONRecordReader'
tableSpec:
tableName: 'TABLE_NAME'
schemaURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME/schema>'
tableConfigURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME>'
segmentNameGeneratorSpec:
type: inputFile
configs:
file.path.pattern: '.+/(.+)\.gz'
segment.name.template: '\${filePathPattern:\1}'
segment.name.prefix: 'batch'
exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: '<http://CONTROLLER_IP:9000>'
pushJobSpec:
pushParallelism: 2
pushAttempts: 2Seunghyun
11/16/2022, 8:57 AMSumit Khaitan
11/16/2022, 8:59 AMSeunghyun
11/16/2022, 3:49 PMSumit Khaitan
11/16/2022, 4:45 PMcontroller.data.dir=<adl2://testing/deep-store/>
pinot.set.instance.id.to.hostname=true
controller.task.scheduler.enabled=true
controller.local.temp.dir=/var/pinot/controller/data-temp
controller.enable.split.commit=true
pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.controller.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.controller.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.controller.segment.fetcher.protocols=file,http,adl2
pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcherSeunghyun
11/16/2022, 5:13 PM<adl2://testing/deep-store/>Seunghyun
11/16/2022, 5:14 PMPROPERTYSTORE/SEGMENTS/<table_name>/<segment_name> . From there, can you try to check what segment.download.url is shown? Also, let's check the pinot controller logs to see the destination path.Seunghyun
11/16/2022, 5:18 PMUsing segment download URI from the controller log? This line should include the final locationSeunghyun
11/16/2022, 5:25 PMURI push or Metadata push with deepStoreCopySumit Khaitan
11/17/2022, 4:45 AMsegment.download.url
{
"id": "file_json",
"simpleFields": {
"segment.crc": "2114409700",
"segment.creation.time": "1668658910795",
"segment.download.url": "<http://pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local:9000/segments/TABLE_NAME/file_json>",
"segment.end.time": "1668657462000",
"segment.end.time.raw": "1668657462",
"segment.index.version": "v3",
"segment.push.time": "1668658970305",
"segment.size.in.bytes": "457141",
"segment.start.time": "1668657407000",
"segment.start.time.raw": "1668657407",
"segment.time.unit": "MILLISECONDS",
"segment.total.docs": "11997"
},
"mapFields": {
"custom.map": {
"input.data.file.uri": "adl2:/logs/2022-11-17-03/file.json"
}
},
"listFields": {}
}Sumit Khaitan
11/17/2022, 4:48 AM2022/11/17 04:22:55.699 INFO [PinotSegmentUploadDownloadRestletResource] [jersey-server-managed-async-executor-1] Using segment download URI: <http://pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local:9000/segments/TABLE_NAME/file_json> for segment: /var/pinot/controller/data-temp/fileUploadTemp/tmp-3d44c1eb-d983-4ac5-b77e-22269e931881 of table: TABLE_NAME_OFFLINESeunghyun
11/17/2022, 5:06 AMCopied segment:Seunghyun
11/17/2022, 5:06 AM<http://LOGGER.info|LOGGER.info>("Copied segment: {} of table: {} to final location: {}", segmentName, tableNameWithType,
finalSegmentLocationURI);Seunghyun
11/17/2022, 5:07 AMSeunghyun
11/17/2022, 5:07 AMSumit Khaitan
11/17/2022, 5:54 AMCopied segment: logs_2022-11-17-04_file_json of table: TABLE_NAME_OFFLINE to final location: file:/var/pinot/controller/data,<adl2://testing/deep-store//TABLE_NAME/file_json>
But I am not able to find anything at adl2://testing/deep-store//TABLE_NAME/file_json. Can this be because of // before table Name ?Seunghyun
11/17/2022, 7:36 AMdeep-store//TABLE_NAME/file_json
// looks to be a bit suspicious.. but you said that servers are downloading files correctlySeunghyun
11/17/2022, 7:37 AMLOGGER.error("Could not move segment {} from table {} to permanent directory", segmentName, tableNameWithType,
e);Seunghyun
11/17/2022, 7:40 AM<adl2://testing/deep-store/> -> <adl2://testing/deep-store>Seunghyun
11/17/2022, 7:41 AMSumit Khaitan
11/17/2022, 8:21 AMLOGGER.error("Could not move segment {} from table {} to permanent directory", segmentName, tableNameWithType,
e);
No this log is not there.Sumit Khaitan
11/17/2022, 8:21 AM<adl2://testing/deep-store/> -> <adl2://testing/deep-store>
Sure will try this one.Sumit Khaitan
11/17/2022, 8:22 AMlooks to be a bit suspicious.. but you said that servers are downloading files correctly -> Yes I am able to query the segments from pinot UI. Hence assuming that servers are correctly downloading the segments.