hello my friends my team has been trying to ingest data usin Apache Pinot #troubleshooting

hello my friends, my team has been trying to inges...

Luis Fernandez

05/31/2022, 2:13 PM

hello my friends, my team has been trying to ingest data using the job spec for some weeks now, and it has been quite challenging, we are trying to ingest around 500gb of data which is 2 years of data for our system, we are using apache pinot

0.10.0

we ran into this issue: https://github.com/apache/pinot/pull/8337 so we had to create a script to do the imports daily, however, for some reason pinot servers are exhausting memory (32gbs) and before running the job they are mostly at half capacity what are some of the reasons that our pinot servers would ran out of memory from these ingestion jobs? also we are using the standalone job and we change the input directory in our script every time it finishes daily. Would appreciate any help!

Ken Krugler

05/31/2022, 5:42 PM

Can’t you use the

pushFileNamePattern

support to build a segment name that’s composed of the previous directory name and the file name? So you could create something like

2009-movies

as the final name.

Luis Fernandez

05/31/2022, 5:52 PM

oh i have to check that out

Luis Fernandez

05/31/2022, 5:52 PM

another question that i had is how do you tell the script to output the logs somewhere just so that i can have it run it as a background task

Luis Fernandez

05/31/2022, 5:52 PM

do you know?

Ken Krugler

05/31/2022, 6:09 PM

Are you talking about the script that runs the admin tool? If so, then it’s the usual Linux command line thing of adding

>>logfile.txt 2>&1

, see https://stackoverflow.com/questions/876239/how-to-redirect-and-append-both-standard-output-and-standard-error-to-a-file-wit

Luis Fernandez

05/31/2022, 6:09 PM

right but that only logs this:

Copy code

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/pinot/lib/pinot-all-0.10.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-environment/pinot-azure/pinot-azure-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-yammer/pinot-yammer-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-dropwizard/pinot-dropwizard-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See <http://www.slf4j.org/codes.html#multiple_bindings> for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/opt/pinot/lib/pinot-all-0.10.0-jar-with-dependencies.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

Luis Fernandez

05/31/2022, 6:09 PM

i’m currently running it like this:

Luis Fernandez

05/31/2022, 6:09 PM

Copy code

JAVA_OPTS='-Xms1G -Xmx1G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent-0.12.0.jar=7007:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml' /opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /opt/pinot/migration/job.yaml

Luis Fernandez

05/31/2022, 6:09 PM

(this is for one day worth of data)

Ken Krugler

05/31/2022, 6:11 PM

Don’t you also wind up with logs in the

logs/

subdir inside of your

/opt/pinot/

directory?

Ken Krugler

05/31/2022, 6:11 PM

e.g.

pinot-all.log

Luis Fernandez

05/31/2022, 6:12 PM

i do have those logs but i guess how would i differentiate what’s log by what

Ken Krugler

05/31/2022, 6:15 PM

The minimal stdout/stderr logging output is what I often see when slf4j finds multiple bindings. I would just focus on what’s in the logs/ subdir.

Ken Krugler

05/31/2022, 6:15 PM

I made a run at fixing up Pinot logging so you wouldn’t get the issue of multiple bindings, but it’s a giant hairball.

Luis Fernandez

05/31/2022, 7:44 PM

so in the logs/subdir i see the logs for the controller itself and i guess i would see for the job too?

Ken Krugler

05/31/2022, 8:55 PM

In a normal configuration, each process (server, broker, controller) has its own log file(s). So in that case, what gets logged when you run the admin app should just be what it’s logging as part of your request. Note that if you’re using Hadoop or Spark to run a segment generation job, then those systems will have their own logging infrastructure as well.

Luis Fernandez

05/31/2022, 8:58 PM

I'm using the standalone mode thank you now we got better logging at least

Luis Fernandez

05/31/2022, 8:59 PM

it has been a little harder to get this import process in place

Luis Fernandez

05/31/2022, 9:10 PM

we have year/month/day/severalfilesperday.parquet because of the bug we are doing imports daily instead

Luis Fernandez

05/31/2022, 9:11 PM

and it takes us days to do these imports

Ken Krugler

05/31/2022, 9:28 PM

If you do a metadata push it should be pretty fast. We load about 1100 segments from HDFS via this approach in a few hours. This assumes segments have been already built and stored in HDFS, which we do via a Hadoop job that takes about an hour or so.

Luis Fernandez

05/31/2022, 9:30 PM

SegmentCreationAndMetadataPush

this one right?

Luis Fernandez

05/31/2022, 9:30 PM

we interface with GCS

Luis Fernandez

05/31/2022, 9:31 PM

and we are just doing standalone

Ken Krugler

05/31/2022, 9:31 PM

Just

SegmentMetadataPush

for us, since we create the segments using a scalable Hadoop map-reduce job.

Luis Fernandez

05/31/2022, 9:37 PM

we have a spark process that grabs the data from bigquery and puts it in gcs

Luis Fernandez

05/31/2022, 9:37 PM

and then we use the standalone job to look at the gcs buckets and create segments and do metadata push

Ken Krugler

05/31/2022, 9:38 PM

So you can use a Spark job to also create the segments from the text files you extract from BigQuery.

Luis Fernandez

05/31/2022, 9:38 PM

which that would be one of these guides right? https://docs.pinot.apache.org/users/tutorials/ingest-parquet-files-from-s3-using-spark

Ken Krugler

05/31/2022, 9:38 PM

That is scalable and can be much, much faster than trying to do it in a single process via a standalone job

Ken Krugler

05/31/2022, 9:40 PM

Yes, that’s the guide. And yes, you can use this to ingest text, parquet, or avro files.

Luis Fernandez

05/31/2022, 9:41 PM

wouldn’t i run into the issue with the version problem that we have with pinot 0.10.0?

Ken Krugler

05/31/2022, 10:01 PM

Are you talking about

pinot servers are exhausting memory (32gbs) and before running the job they are mostly at half capacity what are some of the reasons that our pinot servers would ran out of memory from these ingestion jobs

Luis Fernandez

05/31/2022, 10:02 PM

oh nonono, i’m talking about running this with spark instead of the standalone job, which is what we are doing, i also don’t know why that happened ^

Luis Fernandez

05/31/2022, 10:02 PM

we gave it more memories to the machines but i feel like something else is the root cause

Ken Krugler

05/31/2022, 10:04 PM

In your

tableIndexConfig

make sure you set

"createInvertedIndexDuringSegmentGeneration": true,

Ken Krugler

05/31/2022, 10:04 PM

This is in the table spec (Json file)

Luis Fernandez

05/31/2022, 10:05 PM

let me check what it’s set at

Luis Fernandez

05/31/2022, 10:06 PM

oofff what happens if it’s

false

Ken Krugler

05/31/2022, 10:07 PM

As per https://docs.pinot.apache.org/configuration-reference/table, if it’s false (which is the default) then indexes are created on servers when segments are loaded. which can be both a CPU and memory hog

Luis Fernandez

05/31/2022, 10:07 PM

is it safe to change on a existing table?

Ken Krugler

05/31/2022, 10:08 PM

I believe so, yes - it should only impact the segment generation job, not any segments that have been already deployed

✅ 1

Ken Krugler

05/31/2022, 10:09 PM

Generating the segment with the inverted index makes the segment bigger, but if you’re deploying using metadata push that shouldn’t matter much. Note though that currently metadata push requires each segment be downloaded to the machine running the standalone job, so it can be untarred to extract metadata. So you want a fast connection from that server and your deep store.

✅ 1

Open in Slack

Previous Next