Hi, I am evaluating Apache Pinot and wanted to und...
# troubleshooting
m
Hi, I am evaluating Apache Pinot and wanted to understand the deep storage options while deploying on Azure. From the docs, it seems like Azure blob is not supported and Azure Data Lake Storage has to be used as the deep storage. Can you please confirm on the same. Also is PinotFS the abstraction used for deep storage as well. The docs mention PinotFS in the context of importing data and hence this query.
k
That’s right.. only adls implementation is available as of now.. PinotFs is used for deep storage abstraction as well.. you can look at adsl implementation and add/contribute azure blob storage FS implementation
m
m
Thanks @User @User for the quick response
@User I got a bit confused with the comment above that says - PinotFs is used for deep storage abstraction as well. Does it imply that there are other abstractions as well and we PinotFS is the new approach(or one of the approaches) to abstract deep storage?
k
my bad, pinotFS is the abstraction to interact with deep storage.
m
Thanks, clear now 👍
m
@User To add more context, ADLS gen2 is built on top of ABS (if I am not wrong). And it has better abstraction in terms of what PinotFS needs, so we went with ADLS instead of ABS.
m
yes ADSL is an abstraction over ABS. it is just that ABS is in use in all Azure accounts. However for ADSL it is a new service and so some internal due diligence is required
m
Got it.
m
I was trying to following the documentation https://docs.pinot.apache.org/basics/data-import/pinot-file-system/import-from-adls-azure to use ADLS gen 2 as a deep store
The docs suggest using an adl:// url in controller.data.dir, but adls gen2 seems to have only https: and abfs: endpoints
Also pinot.server.segment.fetcher.protocols=file,http,adl mentions only adl protocol although pinot.server.storage.factory.class.adl=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS ...Can you please help with the right configuration to leverage adls gen 2..
m
@User could we document adls gen 2 configs?
m
would be great help if you can share the configuration here as well please.. @User
s
@User ADLSGen2PinotFS actually doesn’t check the prefix for the existing implementation. We extract the path from the input URI. So, you can pass
<abfs://path/to/intput>
and
<abfs://path/to/output>
for your configuration. Thanks for catching this. we will update our documentation. we need to update
adl->abfs
m
Thanks @User I will try this out. Also if i now change the controller.data.dir from /local/linux/path to abfs:// would the existing segments that were written to /local/linux/path be moved by the controller to the deep store?
s
@User i don’t think that we have the automatic migration. You should manually copy existing segments to ADLS deep storage location.
m
ok @User and how would zookeeper znodes get updated? please share if you have it handy or i can go through the docs as well
s
can you elaborate your question on
how would zookeeper znodes get updated?
?
m
if i copy the segment files to deep storage, the metadata would need to get updated as well i suppose?
s
can you post the segment metadata?
old metadata from local new metadata after adls
anyway, we don’t provide the tool for correcting the metadata.
i recommend to write a script in either java or python to connect to zk directly, iterate through all the segments
PROPERTYSTORE/segments/<table_name>/<segment_name>
and fix the URI if needed
m
@User Here is an example from my instance
{ "id": "bcm__0__100__20220405T0229Z", "simpleFields": { "segment.crc": "467152823", "segment.creation.time": "1649125772669", "segment.download.url": "http://pinot-controller-0.pinot-controller-headless.ev-team-apache-pinot.svc.cluster.local:9000/segments/bcm/bcm__0__100__20220405T0229Z", "segment.end.time": "1647949679176", "segment.flush.threshold.size": "2500000", "segment.index.version": "v3", "segment.realtime.endOffset": "1140058256", "segment.realtime.numReplicas": "1", "segment.realtime.startOffset": "1137558256", "segment.realtime.status": "DONE", "segment.start.time": "1641627252523", "segment.time.unit": "MILLISECONDS", "segment.total.docs": "2500000" }, "mapFields": {}, "listFields": {} }
I am guessing the segment download url should be an abfs url after the deep store is enabled?
m
yes
m
ok..i will try abfs and let you know how it goes
s
I think that it depends on your upload approach. Are you using URI push with ADLS?
if download/upload still goes through the controller, the current metadata may work
m
i am using realtime ingestion..so i guess the old segments can remain on the controller..and the new ones can get ingested to abfs?
Data directory: abfs://ev-pinot-adsl@evpinotadsl.dfs.core.windows.net/ Failed to start a Pinot [CONTROLLER] at 5.364 since launch java.lang.RuntimeException: Caught exception while initializing ControllerFilePathProvider at org.apache.pinot.controller.BaseControllerStarter.initControllerFilePathProvider(BaseControllerStarter.java:543) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7] at org.apache.pinot.controller.BaseControllerStarter.setUpPinotController(BaseControllerStarter.java:367) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7] at org.apache.pinot.controller.BaseControllerStarter.start(BaseControllerStarter.java:315) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7] at org.apache.pinot.tools.service.PinotServiceManager.startController(PinotServiceManager.java:118) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7] at org.apache.pinot.tools.service.PinotServiceManager.startRole(PinotServiceManager.java:87) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7] at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.lambda$startBootstrapServices$0(StartServiceManagerCommand.java:248) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7]
Caused by: java.lang.IllegalStateException: PinotFS for scheme: abfs has not been initialized at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:518) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7] at org.apache.pinot.spi.filesystem.PinotFSFactory.create(PinotFSFactory.java:78) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7e235e06744cf0bba7] at org.apache.pinot.controller.api.resources.ControllerFilePathProvider.<init>(ControllerFilePathProvider.java:70) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-078c711d35769be2dc4e4b7
This was the controller config tried extra: configs: |- pinot.set.instance.id.to.hostname=true controller.task.scheduler.enabled=true controller.local.temp.dir=/tmp/pinot-tmp-data/ pinot.controller.storage.factory.class.adl=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS pinot.controller.storage.factory.adl.accountName=evpinotadsl pinot.controller.storage.factory.adl.accessKey=<accesskey> pinot.controller.storage.factory.adl.fileSystemName=ev-pinot-adsl pinot.controller.segment.fetcher.protocols=file,http,abfs pinot.controller.segment.fetcher.adl.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
do the keys in the above config pinot.controller.storage.factory.adl. also need to be renamed as abfs?
@User
can you please confirm that the keys in the above config also need to be changed from .adl to abfs?
s
can you try to modify
Copy code
pinot.controller.segment.fetcher.protocols=file,http,abfs -> file,http,adl
and retry?
m
ok will try this...will leave the other config params as is
s
it’s a bit confusing but i think that we tried to make the implementation generic for config the convention is like:
Copy code
pinot.controller.storage.factory.class.<protocol_name>=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
      pinot.controller.storage.factory.<protocol_name>.accountName=evpinotadsl
      pinot.controller.storage.factory.<protocol_name>.accessKey=<accesskey>
      pinot.controller.storage.factory.<protocol_name>.fileSystemName=ev-pinot-adsl
      pinot.controller.segment.fetcher.protocols=file,http,<protocol_name>
      pinot.controller.segment.fetcher.<protocol_name>.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher (edited)
as long as you keep
<protocol_name>
the same, it should be working. (e.g. replacing
adl -> adl2
or
adl->abfs
would also work but need to replace everywhere)
m
Thanks @User will try it out later today and let you know how it goes
@User the controller started up fine after this setting..will wait for a couple of hours to see if the segments are getting written to adls as well
its writing segments to deep store as well now..
i hope its alright to have some segments on the controller and the new ones on adls
{ "id": "network__2__125__20220405T2002Z", "simpleFields": { "segment.crc": "787866105", "segment.creation.time": "1649188966339", "segment.download.url": "http://pinot-controller-0.pinot-controller-headless.ev-team-apache-pinot.svc.cluster.local:9000/segments/network/network__2__125__20220405T2002Z", "segment.end.time": "1649207336806", "segment.flush.threshold.size": "2500000", "segment.index.version": "v3", "segment.realtime.endOffset": "910857214", "segment.realtime.numReplicas": "1", "segment.realtime.startOffset": "908357214", "segment.realtime.status": "DONE", "segment.start.time": "1646715959171", "segment.time.unit": "MILLISECONDS", "segment.total.docs": "2500000" }, "mapFields": {}, "listFields": {} }.
vs the below new segment which has the abfs download url
{ "id": "network__2__126__20220406T0109Z", "simpleFields": { "segment.crc": "1547807215", "segment.creation.time": "1649207376167", "segment.download.url": "abfs://ev-pinot-adsl@evpinotadsl.dfs.core.windows.net/deepstore/network/network__2__126__20220406T0109Z", "segment.end.time": "1649223125393", "segment.flush.threshold.size": "2500000", "segment.index.version": "v3", "segment.realtime.endOffset": "913357214", "segment.realtime.numReplicas": "1", "segment.realtime.startOffset": "910857214", "segment.realtime.status": "DONE", "segment.start.time": "1641627420516", "segment.time.unit": "MILLISECONDS", "segment.total.docs": "2500000" }, "mapFields": {}, "listFields": {} }
assuming the servers will be using the download.url from zk nodes to fetch the file and so i am hoping its ok to have some segment data persisted on the controller disk as well
s
@User cool. let me know if you face other issues
m
sure, Thank you for helping out on this issue