https://pinot.apache.org/ logo
#troubleshooting
Title
# troubleshooting
l

Luis Fernandez

05/10/2022, 3:55 PM
hello friends!! we are encountering some issues when migrating data using the job spec, we are basically migrating a bunch of json files in gcs into pinot a json file looks like this:
Copy code
{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1}
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}
we endup having this exception for some of the files:
Copy code
2022/05/10 15:48:19.314 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] Failed to generate Pinot segment for file - <gs://rb>
lau_tmp/raw_data/date=2020-07-25/part-00168-c741f867-338d-4c84-afaf-428f85c14088.c000.json
java.lang.RuntimeException: Unexpected end-of-input within/between Object entries
do you know why we may end up getting these errors?
d

Diogo Baeder

05/10/2022, 4:05 PM
Looks like a JSONL file, not a regular JSON file. If you want a JSON file with multiple rows, you need to put each row as an item of a list, instead of each dict as a line in the file. (E.g. just wrap that whole content with square brackets.)
AFAIK Pinot doesn't support JSONL.
l

Luis Fernandez

05/10/2022, 4:06 PM
hey thank you, we were following this: https://dev.startree.ai/docs/pinot/recipes/ingest-json-files, and it seems it should be ok?
d

Diogo Baeder

05/10/2022, 4:09 PM
Oh... then I don't know, to be honest. What I do know is that that format is not valid JSON - regular JSON parsers won't be able to read that as JSON. That format is JSONL (notice the "L" in the end), where a file has multiple lines and each line contains a valid JSON string.
In my case, in the system I'm developing with Pinot as a database, I'm ingesting from regular JSON files, which always start with a square bracket and ends with it.
l

Luis Fernandez

05/10/2022, 4:14 PM
so what you are saying is that this
Copy code
{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1}
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}
should become this
Copy code
[{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1},
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}]
d

Diogo Baeder

05/10/2022, 4:21 PM
Yeah, maybe that makes it work. Just try it, if it works then that was the problem 🙂
l

Luis Fernandez

05/10/2022, 4:21 PM
the weirdest thing is that we just tried with one of the files that failed we tried that one file in particular with the job spec and it worked 😄
d

Diogo Baeder

05/10/2022, 4:29 PM
With JSONL it worked, then?
l

Luis Fernandez

05/10/2022, 4:29 PM
yes 😄
for one file
then we shove a bunch of then and then it doesn’t like it
d

Diogo Baeder

05/10/2022, 4:32 PM
Hmmm... maybe there's an issue with one of the lines then. I noticed that one of your lines has
c_count
and
c
, where the other doesn't. Maybe them being missing is an issue? Did you set a null value for those columns?
l

Luis Fernandez

05/10/2022, 5:22 PM
i didn’t and ialso thought about that
but the thing is that when we try to just import that one document everything works lol
d

Diogo Baeder

05/10/2022, 5:24 PM
Got it. I don't know what the problem is then. What I would do in that case is do a "binary search" to find the offending line - try half of the document first, if it doesn't work then cut it in half, if it works then bring back some lines, so on and so on, until I find the problematic line.
l

Luis Fernandez

05/10/2022, 5:39 PM
we will try another data format for now
d

Diogo Baeder

05/10/2022, 5:39 PM
Cool
l

Luis Fernandez

05/10/2022, 5:39 PM
these files are generated by spark into this json format
then we have in gcs
d

Diogo Baeder

05/10/2022, 5:39 PM
Got it
l

Luis Fernandez

05/10/2022, 5:40 PM
year/month/day/partfiles.json
and this is what we want to eventually put into pinot, and those are 2 years worth of data
d

Diogo Baeder

05/10/2022, 5:40 PM
Got it, sounds good
l

Luis Fernandez

05/10/2022, 5:42 PM
this job yaml has been not super straight forward to get right lol
do you know who else may have some experience with it?
d

Diogo Baeder

05/10/2022, 5:43 PM
I'm developing a system that was using the regular batch ingestion flow, but now I'm manually ingesting segment data - which also fills my offline table, but through a bit of a different process. The previous process used to work for me.
l

Luis Fernandez

05/10/2022, 5:44 PM
like ingestFromURI?
d

Diogo Baeder

05/10/2022, 5:46 PM
What I'm using now? Yes.
The previous flow was just the regular batch ingestion, with a job YAML config file, which I triggered via the Pinot admin CLI.
m

Mark Needham

05/11/2022, 1:35 PM
Could it be it's doing something weird with the new line character? I can look into it if you have the actual file that doesn't work (scrub any sensitive data etc)
and then either way the error message isn't great so we'd wanna improve that
l

Luis Fernandez

05/11/2022, 2:07 PM
the thing is that when we try to do the one file, it ingests it when we try to do several it just dies with that error for several of them
n

Neha Pawar

05/12/2022, 9:48 PM
is it possible to share all the files, and the exact command you used to ingest multiple at once?