hello friends!! we are encountering some issues wh...
# troubleshooting
l
hello friends!! we are encountering some issues when migrating data using the job spec, we are basically migrating a bunch of json files in gcs into pinot a json file looks like this:
Copy code
{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1}
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}
we endup having this exception for some of the files:
Copy code
2022/05/10 15:48:19.314 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] Failed to generate Pinot segment for file - <gs://rb>
lau_tmp/raw_data/date=2020-07-25/part-00168-c741f867-338d-4c84-afaf-428f85c14088.c000.json
java.lang.RuntimeException: Unexpected end-of-input within/between Object entries
do you know why we may end up getting these errors?
d
Looks like a JSONL file, not a regular JSON file. If you want a JSON file with multiple rows, you need to put each row as an item of a list, instead of each dict as a line in the file. (E.g. just wrap that whole content with square brackets.)
AFAIK Pinot doesn't support JSONL.
l
hey thank you, we were following this: https://dev.startree.ai/docs/pinot/recipes/ingest-json-files, and it seems it should be ok?
d
Oh... then I don't know, to be honest. What I do know is that that format is not valid JSON - regular JSON parsers won't be able to read that as JSON. That format is JSONL (notice the "L" in the end), where a file has multiple lines and each line contains a valid JSON string.
In my case, in the system I'm developing with Pinot as a database, I'm ingesting from regular JSON files, which always start with a square bracket and ends with it.
l
so what you are saying is that this
Copy code
{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1}
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}
should become this
Copy code
[{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1},
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}]
d
Yeah, maybe that makes it work. Just try it, if it works then that was the problem 🙂
l
the weirdest thing is that we just tried with one of the files that failed we tried that one file in particular with the job spec and it worked 😄
d
With JSONL it worked, then?
l
yes 😄
for one file
then we shove a bunch of then and then it doesn’t like it
d
Hmmm... maybe there's an issue with one of the lines then. I noticed that one of your lines has
c_count
and
c
, where the other doesn't. Maybe them being missing is an issue? Did you set a null value for those columns?
l
i didn’t and ialso thought about that
but the thing is that when we try to just import that one document everything works lol
d
Got it. I don't know what the problem is then. What I would do in that case is do a "binary search" to find the offending line - try half of the document first, if it doesn't work then cut it in half, if it works then bring back some lines, so on and so on, until I find the problematic line.
l
we will try another data format for now
d
Cool
l
these files are generated by spark into this json format
then we have in gcs
d
Got it
l
year/month/day/partfiles.json
and this is what we want to eventually put into pinot, and those are 2 years worth of data
d
Got it, sounds good
l
this job yaml has been not super straight forward to get right lol
do you know who else may have some experience with it?
d
I'm developing a system that was using the regular batch ingestion flow, but now I'm manually ingesting segment data - which also fills my offline table, but through a bit of a different process. The previous process used to work for me.
l
like ingestFromURI?
d
What I'm using now? Yes.
The previous flow was just the regular batch ingestion, with a job YAML config file, which I triggered via the Pinot admin CLI.
m
Could it be it's doing something weird with the new line character? I can look into it if you have the actual file that doesn't work (scrub any sensitive data etc)
and then either way the error message isn't great so we'd wanna improve that
l
the thing is that when we try to do the one file, it ingests it when we try to do several it just dies with that error for several of them
n
is it possible to share all the files, and the exact command you used to ingest multiple at once?