- 
                Notifications
    You must be signed in to change notification settings 
- Fork 506
Profiling Code Runtime
        Júlio Arend edited this page Apr 18, 2023 
        ·
        2 revisions
      
    Follow these steps from pytest-profiling for a visualization of the code profile.
- pip install pytest-profiling
- 
pytest_plugins = ['pytest_profiling'](not needed for NeuralProphet as we use setuptools entry points)
- Generate a test file that runs a sample model with a configuration of your choice, e.g. bottlenecks.py
- Run the file in your terminal using pytest path/to/file/bottlenecks.py --profile --profile-svgto profile the code (i.e. measure runtimes of the different code sections) and create a visualization (helpful to identify bottlenecks)
- 
pstatsfiles (one per test item) are retained for later analysis inprofdirectory, along with thecombined.profandcombined.svgfiles
One possible model configuration (single or panel time series) to be profiled:
    start_date = '2019-01-01'
    end_date = '2021-01-01'
    date_range = pd.date_range(start=start_date, end=end_date, freq='H')
    y = np.random.randint(0,1000, size=(len(date_range),))
    df = pd.DataFrame({"ds": date_range, "y": y, "ID": "df1"})
    df_c = df.copy()
    #for i in range(2,101):
    #    df_c['ID'] = f'df{i}'
    #    df = pd.concat((df,df_c))
    m = NeuralProphet(
        n_forecasts=24,
        n_lags=7*24,
        weekly_seasonality=True,
        daily_seasonality=True,
        yearly_seasonality=True,
        num_hidden_layers=4,
        d_hidden=64,
        epochs=10,
        batch_size=448,
        learning_rate=0.001,
    )
    df["A"] = df["y"].rolling(7, min_periods=1).mean()
    df["B"] = df["y"].rolling(30, min_periods=1).mean()
    df['C'] = df['y'].rolling(24, min_periods=1).mean()
    m = m.add_lagged_regressor(names="A", n_lags=24)
    m = m.add_lagged_regressor(names="B")
    m = m.add_lagged_regressor(names="C", n_lags=24, num_hidden_layers=2, d_hidden=48)
    metrics_df = m.fit(df, freq="H", num_workers=64, minimal=True)
    forecast = m.predict(df)
- Runtime increases linearly with number of time series added to the dataset (as of now, April 16, 2023).
- 
TimeDatasetand its subfunctions are slow
- 
drop_nan_after_initis always called, even without specifyingdrop_missing=True
- Consider vectorization (or multiprocessing?) whenever there is for df_name, df in df.groupby('ID')inm.fit()andm.predict()
- Consider FastTensorDataloadersinstead of pytorchDataloaderhttps://towardsdatascience.com/better-data-loading-20x-pytorch-speed-up-for-tabular-data-e264b9e34352
- regardless of normalization type selected ('global'or'local'), we always compute both normalization params
- 
pd.concat()inm.predict()is very slow for large datasets
- TimeNet covariates are more time consuming than AR net. They are called one after another. Vectorize?
- Understand num_workers=parameter, as the max number of CPU cores does not seem to be the best choice in all cases
- preprocessing for large datasets takes a long time, reason unknown
- Implement Ray Lightning? https://speakerdeck.com/anyscale/faster-time-series-forecasting-using-ray-and-anyscale