Hi team I am new to Pinot and have a use case We have minute Apache Pinot #troubleshooting

Hi team. I am new to Pinot and have a use case. We...

Sumit Khaitan

11/01/2022, 12:23 PM

Hi team. I am new to Pinot and have a use case. We have minutely files coming on Azure Blob Storage and want to load those minutely files to Pinot.

Copy code

Can Pinot directly read and ingest those minutely files from Azure Blob Storage or there has to be a Spark/ETL pipeline that needs to ingest the data to Pinot ?

Luis Fernandez

11/01/2022, 2:20 PM

to my knowledge yes, you would need some etl process that grabs that data from there into pinot either thru an offline job or maybe throw those files into kafka and ingest into pinot.

Mayank

11/01/2022, 2:32 PM

@Seunghyun ^^

Seunghyun

11/11/2022, 4:03 PM

@Sumit Khaitan We do have a ADLS Gen2 connector but have no ABS connector available at the moment. Is it possible for u to create the ADLS Gen2 and make filess available there? Once the data is available, you can set up the ingest pipeline using Spark/ETL flow. We provide sample jobs for different platform (standalone/mapreduce/spark etc)

Sumit Khaitan

11/13/2022, 8:36 AM

Thanks @Seunghyun. I was able to setup a ingestion pipeline that reads data from azure blob storage using ADLS Gen2 connector and load into offline table.

👍 1

Seunghyun

11/14/2022, 6:55 PM

@Sumit Khaitan Glad that it worked. ADLS GEN2 connector uses both ABS/ADLS client for different operations. So, some functionalities may not fully work if you directly use ADLS Gen2 connector for ABS. Data ingestion pipeline is probably fine since you tested but if you use ABS as a deep storage while using ADLS connector, it may not work fully. In order to address issues, we have a long term plan to add a dedicated ABS connector. Please stay tuned 🙂

Sumit Khaitan

11/15/2022, 12:15 PM

@Seunghyun I actually ran into the issue. I have configured Azure Storage Account using ADLS2 connector as deep storage. Getting this error in server, when its trying to downloading the segment from Azure Storage Account

Copy code

2022/11/15 11:11:14.605 WARN [PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread_5] Caught exception while fetching segment from: adl2:/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz to: /var/pinot/server/data/index/table_OFFLINE/tmp/tmp-testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json-f4d9175f-ac16-4354-94aa-2cc466987e0c/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz
com.azure.storage.blob.models.BlobStorageException: Status code 400, "<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:ff0f8bb8-101e-0008-70e2-f82663000000
Time:2022-11-15T11:11:14.6025637Z</Message><UriPath><https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz></UriPath></Error>"
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:490) ~[?:?]
	at com.azure.core.http.rest.RestProxy.instantiateUnexpectedException(RestProxy.java:390) ~[pinot-adls-0.12.0-SNAPSHOT-shaded.jar:0.12.0-SNAPSHOT-78504b941331681a8b5bd27e37c176a97e5bbcca]

Job spec File

Copy code

executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: adl2:///testing/SegmentCreationAndMetadataPush/staging/
jobType: SegmentCreationAndMetadataPush
inputDirURI: <adl2://testing/SegmentCreationAndMetadataPush/input/>
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: adl2:///testing/SegmentCreationAndMetadataPush/output/
overwriteOutput: true
pinotFSSpecs:
- scheme: adl2
  className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
  configs:
    accountName: 'AZURE_ACCOUNT_NAME'
    accessKey: 'AZURE_ACCOUNT_KEY'
    fileSystemName: 'AZURE_CONTAINER'
recordReaderSpec:
  dataFormat: 'json'
  className: 'org.apache.pinot.core.data.readers.JSONRecordReader'
tableSpec:
  tableName: 'TABLE_NAME'
  schemaURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME/schema>'
  tableConfigURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME>'
segmentNameGeneratorSpec:
  type: inputFile
  configs:
    file.path.pattern: '.+/(.+)\.gz'
    segment.name.template: '\${filePathPattern:\1}'
    segment.name.prefix: 'batch'
    exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: '<http://CONTROLLER_IP:9000>'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2

Extra config in pinot-server.conf

Copy code

pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.server.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.server.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.server.segment.fetcher.protocols=file,http,adl2
pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Extra config in pinot-controller.conf

Copy code

controller.data.dir=<adl2://testing/deep-store/>
controller.local.temp.dir=/var/pinot/controller/data-temp
controller.enable.split.commit=true
pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.controller.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.controller.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.controller.segment.fetcher.protocols=file,http,adl2
pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Sumit Khaitan

11/15/2022, 12:24 PM

Also to add on when I am using SegmentCreationAndTarPush JobType its working fine. But the segments are stored on controller in this case and not on Azure storage account.

Seunghyun

11/15/2022, 5:42 PM

did you provision Azure Blob Store and try to use

ADLS Gen2

connector to have the access? This won't work well because we currently don't have the official support for ABS at the moment. For testing purpose, pls try to provision ADLS Gen2 storage and try to hook it as the deep storage. Assuming that you already this, can you search for the following string?

Copy code

ADLSGen2PinotFS is initialized

https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-file-system/pinot-[…]in/java/org/apache/pinot/plugin/filesystem/ADLSGen2PinotFS.java Also, from your exception log, this part is suspicious

Copy code

<https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz>

It looks that

AZURE_STORAGE_ACCOUNT

needs to be resolved to the environment variable value that you passed.

Seunghyun

11/15/2022, 5:42 PM

@Sumit Khaitan ^^

Sumit Khaitan

11/15/2022, 6:20 PM

Copy code

<https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$root/testing/SegmentCreationAndMetadataPush/output/testing_SegmentCreationAndMetadataPush_input_SegmentCreationAndMetadataPush_json.tar.gz>

Regarding this. Actually AZURE_STORAGE_ACCOUNT is getting resolved correctly, I have just replaced the name. What looks suspicious to me is that its not resolving the container name from the fileSystemName and using $root container.

Assuming that you already this, can you search for the following string?

-> In pinot-server ?? @Seunghyun

Sumit Khaitan

11/15/2022, 6:25 PM

Yes able to find the log in pinot server: 2022/11/15 054433.392 INFO [ADLSGen2PinotFS] [Start a Pinot [SERVER]] ADLSGen2PinotFS is initialized (accountName=AZURE_STORAGE_ACCOUNT, fileSystemName=null, dfsServiceEndpointUrl=https://AZURE_STORAGE_ACCOUNT.dfs.core.windows.net, blobServiceEndpointUrl=https://AZURE_STORAGE_ACCOUNT.blob.core.windows.net, enableChecksum=false) As suspected fileSystemName=null. But I have passed the container name in the fileSystemName in the extra configs in pinot-server.conf

Copy code

pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.server.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.server.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.server.segment.fetcher.protocols=file,http,adl2
pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Sumit Khaitan

11/15/2022, 6:28 PM

Oh my bad. I have passed the config as pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER in pinot-server.conf file. Let me fix this and retry 🙂

Seunghyun

11/15/2022, 6:33 PM

oh yes

Seunghyun

11/15/2022, 6:33 PM

that's a good catch 🙂 we need to add that on the controller side conf 🙂

Sumit Khaitan

11/15/2022, 6:37 PM

Yes, have changed it to pinot.server.storage.factory.adl2.fileSystemName=AZURE_CONTAINER and will retest the ingestion.

Sumit Khaitan

11/15/2022, 6:57 PM

@Seunghyun its working fine. Segment meta has download.uri of Azure storage account and also the segment is showing GOOD status. I am currently using accessKey as a hardcoded string in all the conf files (controller, server and job-spec.yaml). Is there some other way to use the accessKey rather than hardcoding it ? My pinot cluster is running in kubernetes.

Seunghyun

11/15/2022, 8:18 PM

https://docs.pinot.apache.org/users/tutorials/use-s3-as-deep-store-for-pinot

Seunghyun

11/15/2022, 8:19 PM

In case of S3, we allow to pass the credentials via ENV variable

Seunghyun

11/15/2022, 9:14 PM

I need to double check if Azure service client supports to passing ENV variable to pass secrets like AWS

Seunghyun

11/15/2022, 10:08 PM

Added https://github.com/apache/pinot/issues/9805 issue for this. Current implementation doesn't support to pick the config from the env variable.

Sumit Khaitan

11/16/2022, 6:58 AM

Thanks @Seunghyun

Sumit Khaitan

11/16/2022, 8:31 AM

@Seunghyun one more question. We have different Azure storage account where different jobs are publishing the data. And we want all of these data to be available on Pinot. • Is it possible that Pinot (Server, Controller) can refer to multiple different storage account as deep store ? • Or other approach that I was thinking of is to have a different Ingestion job (Job-Spec file) that reads the raw data from different storage accounts and publishes the segment to a common storage account that Pinot can refer to as a deep-store. ◦ In this case the question arises is, can ingestion-job read from a different storage account (input dir) and publish segments to a different storage account (output dir) ?

Seunghyun

11/16/2022, 8:42 AM

I think that 2nd approach is better at the moment. Do you use tar file push or metadata based push?

Seunghyun

11/16/2022, 8:46 AM

Ingestion job should be able to read data from data in accountA and push data to pinot cluster with the deep store provisioned with accountB - ingestion job spec should configure the accountA’s data as source, pinot controller config should point to the storage in accountB.

Sumit Khaitan

11/16/2022, 8:48 AM

I am using metadata based push

Sumit Khaitan

11/16/2022, 8:49 AM

How can I configure accountA as input and accountB as output in ingestion job spec ?

Copy code

executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: adl2:///testing/SegmentCreationAndMetadataPush/staging/
jobType: SegmentCreationAndMetadataPush
inputDirURI: <adl2://testing/SegmentCreationAndMetadataPush/input/>
includeFileNamePattern: 'glob:**/*.json'
outputDirURI: adl2:///testing/SegmentCreationAndMetadataPush/output/
overwriteOutput: true
pinotFSSpecs:
- scheme: adl2
  className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
  configs:
    accountName: 'AZURE_ACCOUNT_NAME'
    accessKey: 'AZURE_ACCOUNT_KEY'
    fileSystemName: 'AZURE_CONTAINER'
recordReaderSpec:
  dataFormat: 'json'
  className: 'org.apache.pinot.core.data.readers.JSONRecordReader'
tableSpec:
  tableName: 'TABLE_NAME'
  schemaURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME/schema>'
  tableConfigURI: '<http://CONTROLLER_IP:9000/tables/TABLE_NAME>'
segmentNameGeneratorSpec:
  type: inputFile
  configs:
    file.path.pattern: '.+/(.+)\.gz'
    segment.name.template: '\${filePathPattern:\1}'
    segment.name.prefix: 'batch'
    exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: '<http://CONTROLLER_IP:9000>'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2

Seunghyun

11/16/2022, 8:57 AM

hmm I see what you need. I think that this looks not be supported at the moment (I will take a deeper look tomorrow). The workaround would be writing a custom spark job: 1. read data from account A 2. generate pinot segment 3. write output segments to accountB 4. use metadata push to upload segment to pinot cluster Another workaround is to use TarFilePush. This will not need the destination (accountB) info since we upload segment directly to controller and the controller will copy data to deep storage in accountB

Sumit Khaitan

11/16/2022, 8:59 AM

In case of TarFilePush, once the segment is uploaded by controller to deep storage, it deletes the file from its local storage ? And segment meta will have download uri of deep storage and not controller ?

Seunghyun

11/16/2022, 3:49 PM

the local storage will be cleaned up and the segment download will point to the controller. so the data will always move through the controller. Yeah, it’s a bit inefficient compared to the metadata push.

Sumit Khaitan

11/16/2022, 4:45 PM

@Seunghyun I am currently using TarFilePush. I am not able to find any segments on controller and also the segments are properly pushed to servers and are queryable. One interesting thing that I see is I am not able to find the segments on the deep storage (StorageB). Also don't see any error logs on controller. This is my pinot-controller conf.

Copy code

controller.data.dir=<adl2://testing/deep-store/>
pinot.set.instance.id.to.hostname=true
controller.task.scheduler.enabled=true
controller.local.temp.dir=/var/pinot/controller/data-temp
controller.enable.split.commit=true
pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
pinot.controller.storage.factory.adl2.accountName=AZURE_ACCOUNT_NAME
pinot.controller.storage.factory.adl2.accessKey=AZURE_ACCOUNT_KEY
pinot.controller.storage.factory.adl2.fileSystemName=AZURE_CONTAINER
pinot.controller.segment.fetcher.protocols=file,http,adl2
pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Seunghyun

11/16/2022, 5:13 PM

@Sumit Khaitan so have you checked the contents in ?

Copy code

<adl2://testing/deep-store/>

Seunghyun

11/16/2022, 5:14 PM

Can you open up the PInot controller UI and open up the ZK browser? from there, you can go to

PROPERTYSTORE/SEGMENTS/<table_name>/<segment_name>

. From there, can you try to check what

segment.download.url

is shown? Also, let's check the pinot controller logs to see the destination path.

Seunghyun

11/16/2022, 5:18 PM

Can you grep for

Using segment download URI

from the controller log? This line should include the final location

Seunghyun

11/16/2022, 5:25 PM

https://docs.pinot.apache.org/basics/data-import/batch-ingestion#2.-segment-uri-push Once we clear the tar file push, let's review the above doc and see if we can optimize further to use

URI push or Metadata push with deepStoreCopy

Sumit Khaitan

11/17/2022, 4:45 AM

@Seunghyun segment meta has controller url in

segment.download.url

Copy code

{
  "id": "file_json",
  "simpleFields": {
    "segment.crc": "2114409700",
    "segment.creation.time": "1668658910795",
    "segment.download.url": "<http://pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local:9000/segments/TABLE_NAME/file_json>",
    "segment.end.time": "1668657462000",
    "segment.end.time.raw": "1668657462",
    "segment.index.version": "v3",
    "segment.push.time": "1668658970305",
    "segment.size.in.bytes": "457141",
    "segment.start.time": "1668657407000",
    "segment.start.time.raw": "1668657407",
    "segment.time.unit": "MILLISECONDS",
    "segment.total.docs": "11997"
  },
  "mapFields": {
    "custom.map": {
      "input.data.file.uri": "adl2:/logs/2022-11-17-03/file.json"
    }
  },
  "listFields": {}
}

Sumit Khaitan

11/17/2022, 4:48 AM

logs from controller after greping `Using segment download URI`:

Copy code

2022/11/17 04:22:55.699 INFO [PinotSegmentUploadDownloadRestletResource] [jersey-server-managed-async-executor-1] Using segment download URI: <http://pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local:9000/segments/TABLE_NAME/file_json> for segment: /var/pinot/controller/data-temp/fileUploadTemp/tmp-3d44c1eb-d983-4ac5-b77e-22269e931881 of table: TABLE_NAME_OFFLINE

Seunghyun

11/17/2022, 5:06 AM

oh sorry, can you grep for

Copied segment:

Seunghyun

11/17/2022, 5:06 AM

Copy code

<http://LOGGER.info|LOGGER.info>("Copied segment: {} of table: {} to final location: {}", segmentName, tableNameWithType,
          finalSegmentLocationURI);

Seunghyun

11/17/2022, 5:07 AM

the above should give us the final location.

Seunghyun

11/17/2022, 5:07 AM

@Sumit Khaitan ^^

Sumit Khaitan

11/17/2022, 5:54 AM

Copy code

Copied segment: logs_2022-11-17-04_file_json of table: TABLE_NAME_OFFLINE to final location: file:/var/pinot/controller/data,<adl2://testing/deep-store//TABLE_NAME/file_json>

But I am not able to find anything at adl2://testing/deep-store//TABLE_NAME/file_json. Can this be because of // before table Name ?

Seunghyun

11/17/2022, 7:36 AM

hmm interesting..

Copy code

deep-store//TABLE_NAME/file_json

//

looks to be a bit suspicious.. but you said that servers are downloading files correctly

Seunghyun

11/17/2022, 7:37 AM

and we don't see the following log right?

Copy code

LOGGER.error("Could not move segment {} from table {} to permanent directory", segmentName, tableNameWithType,
            e);

Seunghyun

11/17/2022, 7:40 AM

also, let's try to set `

Copy code

<adl2://testing/deep-store/> -> <adl2://testing/deep-store>

Seunghyun

11/17/2022, 7:41 AM

see if this makes it show up..

Sumit Khaitan

11/17/2022, 8:21 AM

Copy code

LOGGER.error("Could not move segment {} from table {} to permanent directory", segmentName, tableNameWithType,
            e);

No this log is not there.

Sumit Khaitan

11/17/2022, 8:21 AM

Copy code

<adl2://testing/deep-store/> -> <adl2://testing/deep-store>

Sure will try this one.

Sumit Khaitan

11/17/2022, 8:22 AM

looks to be a bit suspicious.. but you said that servers are downloading files correctly

-> Yes I am able to query the segments from pinot UI. Hence assuming that servers are correctly downloading the segments.

Open in Slack

Previous Next