Hi team. I am new to Pinot and have a use case. We...
# troubleshooting
s
Hi team. I am new to Pinot and have a use case. We have minutely files coming on Azure Blob Storage and want to load those minutely files to Pinot.
Copy code
Can Pinot directly read and ingest those minutely files from Azure Blob Storage or there has to be a Spark/ETL pipeline that needs to ingest the data to Pinot ?
l
to my knowledge yes, you would need some etl process that grabs that data from there into pinot either thru an offline job or maybe throw those files into kafka and ingest into pinot.
m
@Seunghyun ^^
s
@Sumit Khaitan We do have a ADLS Gen2 connector but have no ABS connector available at the moment. Is it possible for u to create the ADLS Gen2 and make filess available there? Once the data is available, you can set up the ingest pipeline using Spark/ETL flow. We provide sample jobs for different platform (standalone/mapreduce/spark etc)
s
Thanks @Seunghyun. I was able to setup a ingestion pipeline that reads data from azure blob storage using ADLS Gen2 connector and load into offline table.
👍 1
s
@Sumit Khaitan Glad that it worked. ADLS GEN2 connector uses both ABS/ADLS client for different operations. So, some functionalities may not fully work if you directly use ADLS Gen2 connector for ABS. Data ingestion pipeline is probably fine since you tested but if you use ABS as a deep storage while using ADLS connector, it may not work fully. In order to address issues, we have a long term plan to add a dedicated ABS connector. Please stay tuned 🙂
s
@Seunghyun I actually ran into the issue. I have configured Azure Storage Account using ADLS2 connector as deep storage. Getting this error in server, when its trying to downloading the segment from Azure Storage Account
Copy code
2022/11/15 11:11:14.605 WARN [PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread_5] Caught exception while fetching segment from: adl2:/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz to: /var/pinot/server/data/index/table_OFFLINE/tmp/tmp-testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json-f4d9175f-ac16-4354-94aa-2cc466987e0c/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz
com.azure.storage.blob.models.BlobStorageException: Status code 400, "<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:ff0f8bb8-101e-0008-70e2-f82663000000
Time:2022-11-15T11:11:14.6025637Z</Message><UriPath><https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz></UriPath></Error>"
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:490) ~[?:?]
	at com.azure.core.http.rest.RestProxy.instantiateUnexpectedException(RestProxy.java:390) ~[pinot-adls-0.12.0-SNAPSHOT-shaded.jar:0.12.0-SNAPSHOT-78504b941331681a8b5bd27e37c176a97e5bbcca]
Job spec File
Copy code
executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: adl2:///testing/SegmentCreationAndMetadataPush/staging/
jobType: SegmentCreationAndMetadataPush
inputDirURI: <adl2://testing/SegmentCreationAndMetadataPush/input/>
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: adl2:///testing/SegmentCreationAndMetadataPush/output/
overwriteOutput: true
pinotFSSpecs:
- scheme: adl2
  className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
  configs:
    accountName: 'AZURE_ACCOUNT_NAME'
    accessKey: 'AZURE_ACCOUNT_KEY'
    fileSystemName: 'AZURE_CONTAINER'
recordReaderSpec:
  dataFormat: 'json'
  className: 'org.apache.pinot.core.data.readers.JSONRecordReader'
tableSpec:
  tableName: 'TABLE_NAME'
  schemaURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME/schema>'
  tableConfigURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME>'
segmentNameGeneratorSpec:
  type: inputFile
  configs:
    file.path.pattern: '.+/(.+)\.gz'
    segment.name.template: '\${filePathPattern:\1}'
    segment.name.prefix: 'batch'
    exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: '<http://CONTROLLER_IP:9000>'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
Extra config in pinot-server.conf
Copy code
pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.server.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.server.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.server.segment.fetcher.protocols=file,http,adl2
pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Extra config in pinot-controller.conf
Copy code
controller.data.dir=<adl2://testing/deep-store/>
controller.local.temp.dir=/var/pinot/controller/data-temp
controller.enable.split.commit=true
pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.controller.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.controller.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.controller.segment.fetcher.protocols=file,http,adl2
pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Also to add on when I am using SegmentCreationAndTarPush JobType its working fine. But the segments are stored on controller in this case and not on Azure storage account.
s
did you provision Azure Blob Store and try to use
ADLS Gen2
connector to have the access? This won't work well because we currently don't have the official support for ABS at the moment. For testing purpose, pls try to provision ADLS Gen2 storage and try to hook it as the deep storage. Assuming that you already this, can you search for the following string?
Copy code
ADLSGen2PinotFS is initialized
https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-file-system/pinot-[…]in/java/org/apache/pinot/plugin/filesystem/ADLSGen2PinotFS.java Also, from your exception log, this part is suspicious
Copy code
<https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz>
It looks that
AZURE_STORAGE_ACCOUNT
needs to be resolved to the environment variable value that you passed.
@Sumit Khaitan ^^
s
Copy code
<https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz>
Regarding this. Actually AZURE_STORAGE_ACCOUNT is getting resolved correctly, I have just replaced the name. What looks suspicious to me is that its not resolving the container name from the fileSystemName and using $root container.
Assuming that you already this, can you search for the following string?
-> In pinot-server ?? @Seunghyun
Yes able to find the log in pinot server: 2022/11/15 054433.392 INFO [ADLSGen2PinotFS] [Start a Pinot [SERVER]] ADLSGen2PinotFS is initialized (accountName=AZURE_STORAGE_ACCOUNT, fileSystemName=null, dfsServiceEndpointUrl=https://AZURE_STORAGE_ACCOUNT.dfs.core.windows.net, blobServiceEndpointUrl=https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net, enableChecksum=false) As suspected fileSystemName=null. But I have passed the container name in the fileSystemName in the extra configs in pinot-server.conf
Copy code
pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.server.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.server.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.server.segment.fetcher.protocols=file,http,adl2
pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Oh my bad. I have passed the config as pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER in pinot-server.conf file. Let me fix this and retry 🙂
s
oh yes
that's a good catch 🙂 we need to add that on the controller side conf 🙂
s
Yes, have changed it to pinot.server.storage.factory.adl2.fileSystemName=AZURE_CONTAINER and will retest the ingestion.
@Seunghyun its working fine. Segment meta has download.uri of Azure storage account and also the segment is showing GOOD status. I am currently using accessKey as a hardcoded string in all the conf files (controller, server and job-spec.yaml). Is there some other way to use the accessKey rather than hardcoding it ? My pinot cluster is running in kubernetes.
In case of S3, we allow to pass the credentials via ENV variable
I need to double check if Azure service client supports to passing ENV variable to pass secrets like AWS
Added https://github.com/apache/pinot/issues/9805 issue for this. Current implementation doesn't support to pick the config from the env variable.
s
Thanks @Seunghyun
@Seunghyun one more question. We have different Azure storage account where different jobs are publishing the data. And we want all of these data to be available on Pinot. • Is it possible that Pinot (Server, Controller) can refer to multiple different storage account as deep store ? • Or other approach that I was thinking of is to have a different Ingestion job (Job-Spec file) that reads the raw data from different storage accounts and publishes the segment to a common storage account that Pinot can refer to as a deep-store. ◦ In this case the question arises is, can ingestion-job read from a different storage account (input dir) and publish segments to a different storage account (output dir) ?
s
I think that 2nd approach is better at the moment. Do you use tar file push or metadata based push?
Ingestion job should be able to read data from data in accountA and push data to pinot cluster with the deep store provisioned with accountB - ingestion job spec should configure the accountA’s data as source, pinot controller config should point to the storage in accountB.
s
I am using metadata based push
How can I configure accountA as input and accountB as output in ingestion job spec ?
Copy code
executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: adl2:///testing/SegmentCreationAndMetadataPush/staging/
jobType: SegmentCreationAndMetadataPush
inputDirURI: <adl2://testing/SegmentCreationAndMetadataPush/input/>
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: adl2:///testing/SegmentCreationAndMetadataPush/output/
overwriteOutput: true
pinotFSSpecs:
- scheme: adl2
  className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
  configs:
    accountName: 'AZURE_ACCOUNT_NAME'
    accessKey: 'AZURE_ACCOUNT_KEY'
    fileSystemName: 'AZURE_CONTAINER'
recordReaderSpec:
  dataFormat: 'json'
  className: 'org.apache.pinot.core.data.readers.JSONRecordReader'
tableSpec:
  tableName: 'TABLE_NAME'
  schemaURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME/schema>'
  tableConfigURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME>'
segmentNameGeneratorSpec:
  type: inputFile
  configs:
    file.path.pattern: '.+/(.+)\.gz'
    segment.name.template: '\${filePathPattern:\1}'
    segment.name.prefix: 'batch'
    exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: '<http://CONTROLLER_IP:9000>'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
s
hmm I see what you need. I think that this looks not be supported at the moment (I will take a deeper look tomorrow). The workaround would be writing a custom spark job: 1. read data from account A 2. generate pinot segment 3. write output segments to accountB 4. use metadata push to upload segment to pinot cluster Another workaround is to use TarFilePush. This will not need the destination (accountB) info since we upload segment directly to controller and the controller will copy data to deep storage in accountB
s
In case of TarFilePush, once the segment is uploaded by controller to deep storage, it deletes the file from its local storage ? And segment meta will have download uri of deep storage and not controller ?
s
the local storage will be cleaned up and the segment download will point to the controller. so the data will always move through the controller. Yeah, it’s a bit inefficient compared to the metadata push.
s
@Seunghyun I am currently using TarFilePush. I am not able to find any segments on controller and also the segments are properly pushed to servers and are queryable. One interesting thing that I see is I am not able to find the segments on the deep storage (StorageB). Also don't see any error logs on controller. This is my pinot-controller conf.
Copy code
controller.data.dir=<adl2://testing/deep-store/>
pinot.set.instance.id.to.hostname=true
controller.task.scheduler.enabled=true
controller.local.temp.dir=/var/pinot/controller/data-temp
controller.enable.split.commit=true
pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.controller.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.controller.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.controller.segment.fetcher.protocols=file,http,adl2
pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
s
@Sumit Khaitan so have you checked the contents in ?
Copy code
<adl2://testing/deep-store/>
Can you open up the PInot controller UI and open up the ZK browser? from there, you can go to
PROPERTYSTORE/SEGMENTS/<table_name>/<segment_name>
. From there, can you try to check what
segment.download.url
is shown? Also, let's check the pinot controller logs to see the destination path.
Can you grep for
Using segment download URI
from the controller log? This line should include the final location
https://docs.pinot.apache.org/basics/data-import/batch-ingestion#2.-segment-uri-push Once we clear the tar file push, let's review the above doc and see if we can optimize further to use
URI push or Metadata push with deepStoreCopy
s
@Seunghyun segment meta has controller url in
segment.download.url
Copy code
{
  "id": "file_json",
  "simpleFields": {
    "segment.crc": "2114409700",
    "segment.creation.time": "1668658910795",
    "segment.download.url": "<http://pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local:9000/segments/TABLE_NAME/file_json>",
    "segment.end.time": "1668657462000",
    "segment.end.time.raw": "1668657462",
    "segment.index.version": "v3",
    "segment.push.time": "1668658970305",
    "segment.size.in.bytes": "457141",
    "segment.start.time": "1668657407000",
    "segment.start.time.raw": "1668657407",
    "segment.time.unit": "MILLISECONDS",
    "segment.total.docs": "11997"
  },
  "mapFields": {
    "custom.map": {
      "input.data.file.uri": "adl2:/logs/2022-11-17-03/file.json"
    }
  },
  "listFields": {}
}
logs from controller after greping `Using segment download URI`:
Copy code
2022/11/17 04:22:55.699 INFO [PinotSegmentUploadDownloadRestletResource] [jersey-server-managed-async-executor-1] Using segment download URI: <http://pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local:9000/segments/TABLE_NAME/file_json> for segment: /var/pinot/controller/data-temp/fileUploadTemp/tmp-3d44c1eb-d983-4ac5-b77e-22269e931881 of table: TABLE_NAME_OFFLINE
s
oh sorry, can you grep for
Copied segment:
Copy code
<http://LOGGER.info|LOGGER.info>("Copied segment: {} of table: {} to final location: {}", segmentName, tableNameWithType,
          finalSegmentLocationURI);
the above should give us the final location.
@Sumit Khaitan ^^
s
Copy code
Copied segment: logs_2022-11-17-04_file_json of table: TABLE_NAME_OFFLINE to final location: file:/var/pinot/controller/data,<adl2://testing/deep-store//TABLE_NAME/file_json>
But I am not able to find anything at adl2://testing/deep-store//TABLE_NAME/file_json. Can this be because of // before table Name ?
s
hmm interesting..
Copy code
deep-store//TABLE_NAME/file_json
//
looks to be a bit suspicious.. but you said that servers are downloading files correctly
and we don't see the following log right?
Copy code
LOGGER.error("Could not move segment {} from table {} to permanent directory", segmentName, tableNameWithType,
            e);
also, let's try to set `
Copy code
<adl2://testing/deep-store/> -> <adl2://testing/deep-store>
see if this makes it show up..
s
Copy code
LOGGER.error("Could not move segment {} from table {} to permanent directory", segmentName, tableNameWithType,
            e);
No this log is not there.
Copy code
<adl2://testing/deep-store/> -> <adl2://testing/deep-store>
Sure will try this one.
looks to be a bit suspicious.. but you said that servers are downloading files correctly
-> Yes I am able to query the segments from pinot UI. Hence assuming that servers are correctly downloading the segments.