nixtlacommunity #statsforecast

Channels

random

linen

statsforecast

Clarisse Chia

03/07/2024, 2:55 AM

hi all, i’m trying to get the

statsforecast

up and running on pyspark, but have running into the

ModuleNotFoundError: No module named 'fugue'

error, despite having installed

fugue

. i was wondering someone would be willing to help me troubleshoot/chat through what i might not be thinking about. for context, i’m running on python 3.8, pyspark 3.2.1, and scala2.12

Tung Nguyen

03/07/2024, 10:29 AM

Hi Nixtla team, may I know what happens when I don't specify the season_length in the models? Do the models still try to capture seasonality if there are any? I look at one of the series and there's no clear seasonal pattern. If there is, probably monthly or quarterly at best. # Initialize the models models = [ AutoARIMA(), AutoETS(damped=True), DynamicOptimizedTheta() ]. I have weekly data. I've tried m = 52 but got error x must have 2 complete cycles requires 104 observations. x only has 96 observation(s). I think m = 26 also got similar error as well. I'm not sure about m = 13 yet. There are almost 20,000 series ranging from 5 data points to 157 data points.

Thiago Vidigal

03/09/2024, 1:07 PM

Hi everybody! I'm have a troble whem I try to obtaim the in-sample forecasts from my model. The code works fine whithout the arg "fitted" in forecast module, but with arg the code broke and rose this exeption "*NotImplementedError*: return fitted", can anyone help me with this? This is my notebook and my data.

test_data.csv Fitted Forecasts.ipynb

Makarand Batchu

03/14/2024, 2:19 PM

Hi all. I'm trying to fit a model using statsforecast and when I run the below code I get an error - "ImportError: Numba needs NumPy 1.24 or less". Is this as expected or do I need to have a lower version of numpy to get it working?

Copy code

from statsforecast.models import (
    MSTL
)

# Create a list of models and instantiation parameters 
models = [
    MSTL(season_length = [7,31])
]

Makarand Batchu

03/14/2024, 4:10 PM

Hi team. I have a quick question on the 'horizon' parameter of statsforecast. By default, based on the number passed for 'horizon', the model returns predicted values from the next interval from the last interval in training data. Is there a way to modify this? As in assuming that the freq param is days and my model was trained on data till 13/03. For h = 31, by default model.predict() returns predictions from 14/03. Is there a way that model.predict() returns predictions from a custom date other than 14/03? Thanks in advance!

Brian Head

03/14/2024, 5:04 PM

Does statsforecast distributed work with both pyspark.sql.dataframe.DataFrame and pyspark.pandas.frame.DataFrame? I see documentation for the former. I've been trying to get the latter to work and haven't had success and I'm wondering if it is what I'm doing or it's not built to work with that.

Valeriy

03/15/2024, 7:22 PM

I am getting repeated error when trying to use ‘fitted=True’ for further extraction of forecasts in sample. Pretty sure the same code worked on another dataset before. The fit is done as in ’# initialise and train the model sf = StatsForecast(models=models, freq=‘M’, n_jobs=-1, fallback_model = HistoricAverage()) sf.fit(train_df)'

Clarisse Chia

03/18/2024, 3:02 PM

hi team, problem context: i’m trying to forecast for many (200k to 1M+) series with known weekly and holidays/special dates seasonality patterns in databricks. what i’ve tried: for holidays that fall on the same day-of-week, i’ve been able to cut “n=5 pieces of weeks” from each year and join them together to create a artificial timeline, specifying the

n-week

7 days

seasonality to try to capture the holiday/special date seasonality effect. below is how i’ve set up the modeling problem (will add example of modeling code setup in thread) would love advice on the following pieces: 1. how to speed up

.forecast()

, or more specifically, writing the

.forecast()

output? a. context: it’s currently taking anywhere from 5 to 17 hours, depending on what exogenous features i pass in for the ~200k series, despite the shortened “artificial timeline” (vs. full year’s timelines for each past year) 2. how do i reframe the problem such that i’m able to capture holiday/special date effect without having to create this “artificial timeline”? a. context: it’s working when i’m setting up the timeline to capture holiday/special dates that fall on the same “day of week” every year, but i worry about that same ability for holiday/special dates that do not fall on the same “day of week” thanks in advance!!

Makarand Batchu

03/19/2024, 11:15 AM

Hi team. I want to understand a bit more about the

prediction_intervals

parameter in models of statsforecast. I understand that I have to pass

ConformalIntervals

which takes

horizon

and

n_windows

but can someone explain what this all means and how it can be used help me improve forecasts? And how it is different from when nothing is passed for

prediction_intervals

? Thank you in advance.

Brian Head

03/21/2024, 1:41 PM

Hi, I'm working through a conversion of using Statsforecast from the local version to using distributed processing with Spark/Fugue. I've gotten the fill_gaps, mstl_decomposition (with Jose's help), and cross-validation working. However, when I get to

forecast

I get the error in the attached screen shot. Pertinent details: • I've been starting with samples of data (with seeds for consistency) and then will remove that when ready to scale up to the full dataframe.

forecast

actually does work with samples under 5% (less than ~75 series with 48 monthly observations for training and 3 for forecasting). But, when I increase the frac to 0.05 I get this error. • Given the error message, I thought it might be an issue with some of the data pulled in after the increase. However, I have done a couple of things I think rule that out ◦ Displayed the data and looked through it. Everything looked fine. ◦ Pulled it back down to regular pandas dataframe and ran everything that way. It works fine then with no errors--even when increasing the sample to 50%. Before going to our data engineers, I wanted to check if there's any other thoughts or suggestions. They are helpful with many things, but they aren't familiar with Statsforecast, so wanted to rule ou any other things before pulling them in. Thanks for any help you can provide.

Brian Head

03/21/2024, 1:48 PM

image.png

Brian Head

03/21/2024, 2:13 PM

BTW, this is after the forecast function runs for ~23 minutes. Something it does locally in 1.9 seconds.

Jeff Tackes

03/24/2024, 2:54 AM

Hi All, Any insight into why i get flat forecast for ETS (and AutoETS, and near flat with AutoARIMA). I am working with 30min frequency data, and have 2 years of training data. I am loading my season_length =48*7.

Copy code

sf = StatsForecast(
    models = [ETS(season_length=48*7)],
    freq = "30min"
)

sf.fit(ts_train,        
        id_col = 'LCLid',
        time_col = 'timestamp',
        target_col = 'energy_consumption', )
sf.predict(h=48)

My data has enough fluctuation where i would have thought there would be better "movement". When i run ETS using DARTS, i do not get a flat forecast and get cyclic patterns showing in my forecast. Additionally, when i run ETS in NIXTLA, it takes several minutes whereas in DARTS it took 26 seconds.

Makarand Batchu

03/25/2024, 2:43 PM

Hi team. I am trying out the cross validation functionality in statsforecast. Can you please explain

n_windows

and

step_size

with an example? As it is unclear as to how to chose these parameter values. Thanks in advance!

Brian Head

03/26/2024, 6:55 PM

I have two questions about "local" processing vs distributed processing with spark. 1. Can anyone offer any guidance on optimizing distributed processing, in part repartitions? I've got mine ordered and then repartition trying between 50 and 150 (by unique_id) across 8-12 cores in Databrick. When the system isn't loaded with other work (from co-workers) it runs successfully. However, a. Oddly the 5 fold CV I'm using runs much faster than both the

forecast

function and

forecast_fitted_values function

. For example, on my local laptop the CV and forecast functions run for approximately the same amount of time and the extraction of fitted values takes only a few seconds. However, when using spark in Databricks, the forecast and forecast_fitted values functions take about 3-4 times as long as the CV. Is that normal behavior? I'm wondering if it might have anything to do with the partitioning. b. I've read some sources that say there should be 3-4 partitions per core. However, that's not realistic at all for my situation given the resources my team and I have. Is there any other guideline for the number of partitions? 2. I can understand that for non-statistical models I might get slightly different results. However, assuming I've got the exact same data, I should get the same results when training and forecasting with a statistical model no matter the processing type (e.g., local or distributed) and environment (e.g., laptop vs something like Databricks), right?

Clarisse Chia

03/27/2024, 2:53 PM

Hi team, I have a question regarding sudden unexpected forecasts. I have been running the forecast with the same parameters with no issues, but I have recently been getting all either

null

forecasts and was wondering what might be going wrong. the dataset im working with is sensitive, but if helpful, below is the simple model setup

Copy code

from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive, AutoARIMA

# configure model
models = [AutoARIMA(season_length=7, nmodels=5, trace=True)]
statsforecast = StatsForecast(models=models, freq="D", fallback_model=SeasonalNaive(season_length=7), n_jobs=-1)

# forecast
horizon = test_x.select('ds').dropDuplicates().count()
forecast_results = statsforecast.forecast(df=train_set, h=horizon, X_df=test_x)

the model has been working quite well until recently, when i changed how one exogenous variable would look in the future forecast (within `test_x`; based on business assumptions)

Valeriy

04/02/2024, 3:50 PM

I am producing prediction intervals with specified levels array([0.95, 0.9 , 0.8 , 0.7 , 0.6 , 0.5 , 0.4 , 0.3 , 0.2 , 0.1 ]) in the results columns come with imprecise numbers for some reason.

Valeriy

04/04/2024, 2:19 PM

Is there anyway to get rid of this warning when importing statsforecast?

Jeff Tackes

04/04/2024, 8:44 PM

Do others have the same experience that THETA methods in statsforecast are very slow? I am working with 30min data, and in DARTS theta takes <1 sec. In Nixtla, it takes 2 minutes for a single time series. It is a large time series, with 35,000 records.

👍 1

Clarisse Chia

04/04/2024, 10:29 PM

hi team, a forecast model that i’ve been building has been forecasting extreme values (e.g., forecasts negative/positive quadrillion when it should be forecasting ~10 millions) and i was wondering if there is something i should understand about how

AutoARIMA(season_length=7)

uses the exogenous variables we feed it. context on model setup: 1. ~4 years of complete daily sales history 2. exogenous variables a. covid indicators b. day of week indicators c. day of week * holiday indicators i. idea here is to capture sales peak for each holiday, especially when a holiday falls on a different day of week each year ii. [problem] this is where i notice, that when i have multiple holidays fall really close to each other (e.g., superbowl/st. patricks/easter), the forecasts can output some pretty extreme and unreasonable values 1. i wonder if the exogenous variables may be multiplicative (rather than additive, causing these extremely values to populate when these indicators fall on the same dates?) would really appreciate it if folks have any suggestions of what I might be missing!

Valeriy

04/05/2024, 1:59 PM

I have an issue with AutoARIMA and even ARIMA crashing due to memory issues on one time series, is this a known issue/any workarounds? Notebook crashed both on laptop and also Colab with high memory. Setup with external variables.

Valeriy

04/06/2024, 1:04 PM

Is there AutoSARIMAX in statsforecast?

Vítor Barbosa

04/18/2024, 4:32 PM

Hi team, are there any parameters or tips to speedup AutoARIMA or AutoETS?

Nils de Korte

04/19/2024, 11:07 AM

Hi team, I am using AutoTheta as a trend_forecaster for MSTL. It chooses the best Theta model automatically, but how do I know which one it chooses? And what the scores of the others are? Thanks!

Abishek

04/21/2024, 6:13 AM

hi channel, I know it is not right place. A simple

transform(generate_data(20), forecast, partition={"num":500, "by":"unique_id"}).show()

throws error. can anyone help. error

Copy code

8 ERROR ArrowPythonRunner: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/anaconda3/envs/bigdata/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 1225, in main
    eval_type = read_int(infile)

Abishek

04/21/2024, 7:06 PM

solved the error. Java and pandas version mismatch with pyspark. Java should be less than or equal to 17.

Bharath Vishal G

04/22/2024, 6:42 PM

I used SHAP for explainability with MLForecast modes Is there any library integrated with

StatsForecast

that could help me show explainability for StatsForecast models, , Appreciate any resources/direction?

Dimitris Floros

04/23/2024, 2:02 AM

Hello all! Is there any way to predict at a different cutoff than the one at the end of my training set? For instance, MLForecast allows the

df_new

in predict, is there something similar in statsforecast?

Jeff Tackes

04/25/2024, 12:20 AM

hello - is there a rolling moving average method in statsforecast?

Yan Liu

04/26/2024, 4:40 AM

hello all ! For AutoARIMA training, would reduce

nmodels

from 5 to 4 significantly impact training time? We're training AutoARIMA for 8,000 - 12,000 time series models using the AutoARIMA instance specified as below, but it takes very long time and for some instances we saw this refitting (while ARIMA is almost instantaneous)

auto_arima_model = [AutoARIMA(season_length=7, nmodels=5, trace=True)]