hello friends we are encountering some issues when migrating Apache Pinot #troubleshooting

hello friends!! we are encountering some issues wh...

Luis Fernandez

05/10/2022, 3:55 PM

hello friends!! we are encountering some issues when migrating data using the job spec, we are basically migrating a bunch of json files in gcs into pinot a json file looks like this:

Copy code

{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1}
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}

we endup having this exception for some of the files:

Copy code

2022/05/10 15:48:19.314 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] Failed to generate Pinot segment for file - <gs://rb>
lau_tmp/raw_data/date=2020-07-25/part-00168-c741f867-338d-4c84-afaf-428f85c14088.c000.json
java.lang.RuntimeException: Unexpected end-of-input within/between Object entries

do you know why we may end up getting these errors?

Diogo Baeder

05/10/2022, 4:05 PM

Looks like a JSONL file, not a regular JSON file. If you want a JSON file with multiple rows, you need to put each row as an item of a list, instead of each dict as a line in the file. (E.g. just wrap that whole content with square brackets.)

Diogo Baeder

05/10/2022, 4:05 PM

AFAIK Pinot doesn't support JSONL.

Luis Fernandez

05/10/2022, 4:06 PM

hey thank you, we were following this: https://dev.startree.ai/docs/pinot/recipes/ingest-json-files, and it seems it should be ok?

Diogo Baeder

05/10/2022, 4:09 PM

Oh... then I don't know, to be honest. What I do know is that that format is not valid JSON - regular JSON parsers won't be able to read that as JSON. That format is JSONL (notice the "L" in the end), where a file has multiple lines and each line contains a valid JSON string.

Diogo Baeder

05/10/2022, 4:10 PM

In my case, in the system I'm developing with Pinot as a database, I'm ingesting from regular JSON files, which always start with a square bracket and ends with it.

Luis Fernandez

05/10/2022, 4:14 PM

so what you are saying is that this

Copy code

{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1}
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}

should become this

Copy code

[{"serve_time":1623110400.00000000,"p_id":8.0476135E7,"u_id":6047599.0,"i_count":1},
{"serve_time":1623110400.00000000,"p_id":8.1923416E7,"u_id":5407252.0,"i_count":1,"c_count":1,"c":17}]

Diogo Baeder

05/10/2022, 4:21 PM

Yeah, maybe that makes it work. Just try it, if it works then that was the problem 🙂

Luis Fernandez

05/10/2022, 4:21 PM

the weirdest thing is that we just tried with one of the files that failed we tried that one file in particular with the job spec and it worked 😄

Diogo Baeder

05/10/2022, 4:29 PM

With JSONL it worked, then?

Luis Fernandez

05/10/2022, 4:29 PM

yes 😄

Luis Fernandez

05/10/2022, 4:29 PM

for one file

Luis Fernandez

05/10/2022, 4:29 PM

then we shove a bunch of then and then it doesn’t like it

Diogo Baeder

05/10/2022, 4:32 PM

Hmmm... maybe there's an issue with one of the lines then. I noticed that one of your lines has

c_count

and

, where the other doesn't. Maybe them being missing is an issue? Did you set a null value for those columns?

Luis Fernandez

05/10/2022, 5:22 PM

i didn’t and ialso thought about that

Luis Fernandez

05/10/2022, 5:23 PM

but the thing is that when we try to just import that one document everything works lol

Diogo Baeder

05/10/2022, 5:24 PM

Got it. I don't know what the problem is then. What I would do in that case is do a "binary search" to find the offending line - try half of the document first, if it doesn't work then cut it in half, if it works then bring back some lines, so on and so on, until I find the problematic line.

Luis Fernandez

05/10/2022, 5:39 PM

we will try another data format for now

Diogo Baeder

05/10/2022, 5:39 PM

Cool

Luis Fernandez

05/10/2022, 5:39 PM

these files are generated by spark into this json format

Luis Fernandez

05/10/2022, 5:39 PM

then we have in gcs

Diogo Baeder

05/10/2022, 5:39 PM

Got it

Luis Fernandez

05/10/2022, 5:40 PM

year/month/day/partfiles.json

and this is what we want to eventually put into pinot, and those are 2 years worth of data

Diogo Baeder

05/10/2022, 5:40 PM

Got it, sounds good

Luis Fernandez

05/10/2022, 5:42 PM

this job yaml has been not super straight forward to get right lol

Luis Fernandez

05/10/2022, 5:42 PM

do you know who else may have some experience with it?

Diogo Baeder

05/10/2022, 5:43 PM

I'm developing a system that was using the regular batch ingestion flow, but now I'm manually ingesting segment data - which also fills my offline table, but through a bit of a different process. The previous process used to work for me.

Luis Fernandez

05/10/2022, 5:44 PM

like ingestFromURI?

Diogo Baeder

05/10/2022, 5:46 PM

What I'm using now? Yes.

Diogo Baeder

05/10/2022, 5:47 PM

The previous flow was just the regular batch ingestion, with a job YAML config file, which I triggered via the Pinot admin CLI.

Mark Needham

05/11/2022, 1:35 PM

Could it be it's doing something weird with the new line character? I can look into it if you have the actual file that doesn't work (scrub any sensitive data etc)

Mark Needham

05/11/2022, 1:35 PM

and then either way the error message isn't great so we'd wanna improve that

Luis Fernandez

05/11/2022, 2:07 PM

the thing is that when we try to do the one file, it ingests it when we try to do several it just dies with that error for several of them

Neha Pawar

05/12/2022, 9:48 PM

is it possible to share all the files, and the exact command you used to ingest multiple at once?

Open in Slack

Previous Next