hello my friends, my team has been trying to inges...
# troubleshooting
l
hello my friends, my team has been trying to ingest data using the job spec for some weeks now, and it has been quite challenging, we are trying to ingest around 500gb of data which is 2 years of data for our system, we are using apache pinot
0.10.0
we ran into this issue: https://github.com/apache/pinot/pull/8337 so we had to create a script to do the imports daily, however, for some reason pinot servers are exhausting memory (32gbs) and before running the job they are mostly at half capacity what are some of the reasons that our pinot servers would ran out of memory from these ingestion jobs? also we are using the standalone job and we change the input directory in our script every time it finishes daily. Would appreciate any help!
k
Can’t you use the
pushFileNamePattern
support to build a segment name that’s composed of the previous directory name and the file name? So you could create something like
2009-movies
as the final name.
l
oh i have to check that out
another question that i had is how do you tell the script to output the logs somewhere just so that i can have it run it as a background task
do you know?
k
Are you talking about the script that runs the admin tool? If so, then it’s the usual Linux command line thing of adding
>>logfile.txt 2>&1
, see https://stackoverflow.com/questions/876239/how-to-redirect-and-append-both-standard-output-and-standard-error-to-a-file-wit
l
right but that only logs this:
Copy code
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/pinot/lib/pinot-all-0.10.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-environment/pinot-azure/pinot-azure-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-yammer/pinot-yammer-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-dropwizard/pinot-dropwizard-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See <http://www.slf4j.org/codes.html#multiple_bindings> for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/opt/pinot/lib/pinot-all-0.10.0-jar-with-dependencies.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
i’m currently running it like this:
Copy code
JAVA_OPTS='-Xms1G -Xmx1G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent-0.12.0.jar=7007:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml' /opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /opt/pinot/migration/job.yaml
(this is for one day worth of data)
k
Don’t you also wind up with logs in the
logs/
subdir inside of your
/opt/pinot/
directory?
e.g.
pinot-all.log
?
l
i do have those logs but i guess how would i differentiate what’s log by what
k
The minimal stdout/stderr logging output is what I often see when slf4j finds multiple bindings. I would just focus on what’s in the logs/ subdir.
I made a run at fixing up Pinot logging so you wouldn’t get the issue of multiple bindings, but it’s a giant hairball.
l
so in the logs/subdir i see the logs for the controller itself and i guess i would see for the job too?
k
In a normal configuration, each process (server, broker, controller) has its own log file(s). So in that case, what gets logged when you run the admin app should just be what it’s logging as part of your request. Note that if you’re using Hadoop or Spark to run a segment generation job, then those systems will have their own logging infrastructure as well.
l
I'm using the standalone mode thank you now we got better logging at least
it has been a little harder to get this import process in place
we have year/month/day/severalfilesperday.parquet because of the bug we are doing imports daily instead
and it takes us days to do these imports
k
If you do a metadata push it should be pretty fast. We load about 1100 segments from HDFS via this approach in a few hours. This assumes segments have been already built and stored in HDFS, which we do via a Hadoop job that takes about an hour or so.
l
SegmentCreationAndMetadataPush
this one right?
we interface with GCS
and we are just doing standalone
k
Just
SegmentMetadataPush
for us, since we create the segments using a scalable Hadoop map-reduce job.
l
we have a spark process that grabs the data from bigquery and puts it in gcs
and then we use the standalone job to look at the gcs buckets and create segments and do metadata push
k
So you can use a Spark job to also create the segments from the text files you extract from BigQuery.
l
k
That is scalable and can be much, much faster than trying to do it in a single process via a standalone job
Yes, that’s the guide. And yes, you can use this to ingest text, parquet, or avro files.
l
wouldn’t i run into the issue with the version problem that we have with pinot 0.10.0?
k
Are you talking about
pinot servers are exhausting memory (32gbs) and before running the job they are mostly at half capacity what are some of the reasons that our pinot servers would ran out of memory from these ingestion jobs
?
l
oh nonono, i’m talking about running this with spark instead of the standalone job, which is what we are doing, i also don’t know why that happened ^
we gave it more memories to the machines but i feel like something else is the root cause
k
In your
tableIndexConfig
make sure you set
"createInvertedIndexDuringSegmentGeneration": true,
This is in the table spec (Json file)
l
let me check what it’s set at
oofff what happens if it’s
false
?
k
As per https://docs.pinot.apache.org/configuration-reference/table, if it’s false (which is the default) then indexes are created on servers when segments are loaded. which can be both a CPU and memory hog
l
is it safe to change on a existing table?
k
I believe so, yes - it should only impact the segment generation job, not any segments that have been already deployed
1
Generating the segment with the inverted index makes the segment bigger, but if you’re deploying using metadata push that shouldn’t matter much. Note though that currently metadata push requires each segment be downloaded to the machine running the standalone job, so it can be untarred to extract metadata. So you want a fast connection from that server and your deep store.
1