Hi, folks - I am fairly new to Pinot, and I am jus...
# getting-started
m
Hi, folks - I am fairly new to Pinot, and I am just trying to understand what needs to be packaged into the directory (along with the data elements) for the
UgloadSegment
command to work, I am invoking the following command from the "pinot" directory (where the "git clone" command brought in all the pinot source code):
Copy code
./build/bin/pinot-admin.sh UploadSegment -controllerHost A.B.C.D -controllerPort 9000 -segmentDir ./july-13-segment
where A.B.C.D is a Linux machine (provisioned through Google Cloud) that our team has set up to be our initial instance of Pinot (we are initially just provisioning one instance, until we need to scale up). The
july-13-segment
directory just contains 3 files: two data files that are meant to wind up in the same segment; the files are named 2022-07-13T22_02_50.179274000Z.json and 2022-07-13T22_02_52.770718122Z.json, and each contains a single JSON string that we were able to successfully import into Pinot via Kafka (until we determined that Kafka seemed to be our performance bottleneck). The third file, named
schema.json
, is the definition of the schema of the table I want the segments to go into. I'll attach it to this message, in case the contents do matter. When I run the above command, the output/error message are:
Copy code
...   [Lots of messages about plugins]
Uploading segment tar file: ./july-13-segment/schema.json
Sending request: <http://A.B.C.D:9000/v2/segments?tableName> to controller: pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local, version: Unknown
org.apache.pinot.common.exception.HttpErrorStatusException: Got error status code: 500 (Internal Server Error) with reason: "Exception while uploading segment: Input is not in the .gz format" while sending request: <http://A.B.C.D:9000/v2/segments?tableName> to controller: pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local, version: Unknown
	at org.apache.pinot.common.utils.http.HttpClient.wrapAndThrowHttpException(HttpClient.java:442)
	at org.apache.pinot.common.utils.FileUploadDownloadClient.uploadSegment(FileUploadDownloadClient.java:597)
	at org.apache.pinot.tools.admin.command.UploadSegmentCommand.execute(UploadSegmentCommand.java:176)
	at org.apache.pinot.tools.Command.call(Command.java:33)
	at org.apache.pinot.tools.Command.call(Command.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:165)
	at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:196)
If, instead of specifying the directory with the 2 data files and the schema.json file, I created a .
july-13-segment.tar.gz
file (the same directory, tarred and gzipped), and specify that filename instead of the directory, namely
Copy code
build/bin/pinot-admin.sh UploadSegment -controllerHost 35.226.77.155 -controllerPort 9000 -segmentDir ./july-13-segment.tar.gz
I get nearly the same error message (just without the "Input is not in the .gz format" part of the error:
Copy code
...
Executing command: UploadSegment -controllerProtocol http -controllerHost A.B.C.D -controllerPort 9000 -segmentDir ./july-13-segment.tar.gz
java.lang.NullPointerException
	at org.apache.pinot.shaded.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770)
	at org.apache.pinot.tools.admin.command.UploadSegmentCommand.execute(UploadSegmentCommand.java:158)
	at org.apache.pinot.tools.Command.call(Command.java:33)
	at org.apache.pinot.tools.Command.call(Command.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:165)
	at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:196)
My question: The "UploadSegment" documentation is rather incomplete; it really does not spell out what needs to be included in the directory (or the tarred-and-gzipped version of the directory, except that the files should have a suffix indicating their type; I am using "json" for all 3). Do I need to include additional files (if so, what is needed?), or rename any of the files? (Thanks in advance!)
m
A Pinot segment in its uncompressed form is a directory containing data+metadata. This command takes as input this segment (in directory form). Agreed that it is a bit confusing and can have better UX.
However, the more popular way to ingest data is: https://docs.pinot.apache.org/basics/data-import/batch-ingestion
m
Thanks, Mayank -- I will give this a try today.
I was able to get the
/ingestFromFile
curl command to work -- now it's just a matter of getting the individual files that "add up" to the segment together into one file; the documentation indicates that API is "NOT meant for ... large input files"; how large is "large" ? My two test files were 10K and 7K in size, each containing 2 seconds of data. I believe we may want 1 segment to represent 15 minutes of data; is 4.5MB of data hitting that "large" threshold? Or does "large" mean gigabytes (or larger)?
m
Large means GB range
m
Great, thank you!