Agree with Jackie on trying the vector api on the tight loops in this PR and see if they can be further vectorized.
Other than that, please take a look at SumAggregationFunction, Add transform function to begin with. There are loops there to compute running sum of an array and to sum two arrays into a resulting 3rd array. In my previous experience with C++, I saw huge improvements by directly using simd instructions on avx 512.. So may be worth exploring if using vector api in these loops can help vectorize more than what JVM is already doing (if at all)