API#

Configuration#

class endogen.config.Config(_variable_dict: dict, global_config: ~endogen.config.GlobalSimConfig, variables: ~typing.List[~endogen.config.InputModel | ~endogen.config.ExogenModel] = <factory>)[source]#

Global configuration schema for the endogenous simulation. Can be used separately, or in conjunction with .yaml files and hydra.initialize, hydra.compose, and hydra.utils.instantiate.

Parameters:

global_config (GlobalSimConfig) – Global simulation configuration options
variables (list[InputModel]) – List of configuration schema for input models to include in endogenous simulation. Note restrictions on circularity, etc.

class endogen.config.Differences(type: Literal['growth'], input_vars: List[str])[source]#

A schema for describing a growth variables and a factory for variables.VariableDifference variables. Can be used separately, or in conjunction with .yaml files and hydra.initialize, hydra.compose, and hydra.utils.instantiate.

Will return output variables named the same as the input variables, only with “_gr” suffixed.

Parameters:

type (Literal["growth]) – Growth is currently the only difference function available. Note that dividing with zero is a possibility.
input_vars (list[str]) – List of input variables to transform.

class endogen.config.ExogenModel(output_var: str, exogen_data: str, subset: int = 1)[source]#

Configuration schema for statistical model of any variable, to be used in endogenous simulation. Can be used separately, or in conjunction with .yaml files and hydra.initialize, hydra.compose, and hydra.utils.instantiate.

An ExogenModel variable is forecast data coming from somewhere else. Currenly only supports deterministic exogenous variables. It has to be a complete set of data for all units in the simulation system, from the start date of simulation to the end date.

output_varstr: The name of the output variable in question.
exogen_datastr: String path to .csv or .parquet file only including time_var, unit_var and output_var.
subsetint: This should just always be 1. Might remove this as an option.

property node: Tuple[str, Mapping[str, Any]]#

A node representation that interface well with NetworkX graphs.

Returns:: A tuple where the first element is the output variable name (“node”), and the second element is a dictionary of node data.
Return type:: Tuple[str, Mapping[str, Any]]

class endogen.config.GlobalSimConfig(input_data: str, time_var: str, unit_var: str, nsim: int, end: int, include_past_n: int, start: int | None = None, vars: ~typing.List[str] | None = <factory>)[source]#

Configuration schema for global simulation options. Can be used separately, or in conjunction with .yaml files and hydra.initialize, hydra.compose, and hydra.utils.instantiate.

Parameters:

input_data (str) – Path to input data in either .csv or .parquet file format. Used both for training and as initial values in simulation.
time_var (str) – Name of the variable in the input_data indicating the time dimension. The variable must be integer type.
unit_var (str) – Name of the variable in the input_data indicating the unit/spatial dimension. The variable must be integer type.
nsim (int) – Number of independent simulations to run.
end (int) – The time-unit to end simulation. Since these are fully described endogenous simulations, they can go indefinitely.
include_past_n (int) – How much of the past to include when fitting statistical models.
start (int) – The time-unit to start simulation. Must be an integer value found in the time_var series in the input_data.
vars (list[str]) – The subset of variables in the input_data to include.

class endogen.config.InputModel(stage: ~typing.Literal['writing', 'evaluating', 'production'], output_var: str, input_vars: ~typing.List[str], model: ~typing.Any, lags: ~typing.List[~endogen.config.Lags] | None = <factory>, rolling: ~typing.List[~endogen.config.Rolling] | None = <factory>, differences: ~typing.List[~endogen.config.Differences] | None = <factory>, transforms: ~typing.List[~endogen.config.Transform] | None = <factory>, subset: int = 1)[source]#

Endogenous simulation requires not only knowledge of the statistical model, but also of any other variable input in the model. Some of these might be statistical models on their own (e.g., an InputModel), whilst other variables might be variable transforms of various types (see the variables module). These models must be fully specified here. Note the naming conventions for variable transforms in the various config schemas. E.g., for referencing a 1-year lagged variable as input_var, you can put the non-lagged variable “var1” in InputModel.lags, and “var1_l1” in InputModel.input_vars. The endogen.ModelController will make sure variables are calculated in the correct sequence.

Parameters:

stage (Literal["writing", "evaluating", "production"]) – Information on at which development-stage the InputModel can be said to live in. Can be useful in larger production settings.
output_var (str) – The name of the output variable in question.
input_vars (list[str]) – List of input variables the output variable needs in its model.
model (Any) – Any supported statistical (or otherwise) model class that can produce numerical output (forecasts) based on input data. Currently, that means any sklearn.base.BaseEstimator subclass or mlforecast.forecast.MLForecast
lags (list[Lags]) – List of config.Lags necessary to build the input_vars.
rolling (list[Rolling]) – List of config.Rolling necessary to build the input_vars.
differences (list[Differences]) – List of config.Differences necessary to build the input_vars.
transforms (list[Transforms]) – List of config.Transforms necessary to build the input_vars.
subset (int) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If the variable is subset == 0, it is estimated/calculated in the transformation step, if it is 1, it is in the forecast step.

property edges: List[Tuple[str, str]]#

The edges between input variables and the output variable that interface well with NetworkX graphs.

Returns:: A list of graph edges describing the links between the input_vars and the output_var.
Return type:: List[Tuple[str, str]]

property node: Tuple[str, Mapping[str, Any]]#

A node representation that interface well with NetworkX graphs.

Returns:: A tuple where the first element is the output variable name (“node”), and the second element is a dictionary of node data.
Return type:: Tuple[str, Mapping[str, Any]]

class endogen.config.Lags(num_lag: int, input_vars: List[str])[source]#

A schema for describing a lagged variables and a factory for variables.VariableLag variables. Can be used separately, or in conjunction with .yaml files and hydra.initialize, hydra.compose, and hydra.utils.instantiate.

Will return output variables named the same as the input variables, only with “_l{num_lag}” suffixed.

Parameters:

num_lag (int) – How many time-units to offset. E.g., 1 would lag a time-series 1 time-unit compared to the input_var.
input_vars (list[str]) – List of input variables to transform.

class endogen.config.Rolling(window: int, funs: List[Literal['mean', 'sum']], input_vars: List[str], window_type: Literal['normal', 'span', 'com', 'halflife', 'alpha'] = 'normal')[source]#

A schema for describing variables with “rolling” transformations and a factory for variables.VariableRolling variables. Can be used separately, or in conjunction with .yaml files and hydra.initialize, hydra.compose, and hydra.utils.instantiate.

Will return output variables named the same as the input variables, only with a suffix according to this scheme:

{input_var}_{window_type_suffix}{fun_suffix}{window} where:

window_type	suffix
normal	_r
span	_rsp
com	_rc
halflife	_hl
alpha	_ral

fun	suffix
mean	m
sum	s

Parameters:

window (int) – The window size in time-units.
funs (list[Literal["mean", "sum]]) – List of aggregation functions (rolling mean or rolling sum).
input_vars (list[str]) – List of input variables to transform.
window_type (Literal["normal", "span", "com", "halflife", "alpha"]) – “normal” is equally weighted. See pandas.DataFrame.ewm for details on the rest.

class endogen.config.Transform(output_var: str, input_vars: List[str], formula: str, after_forecast: bool = False)[source]#

A schema for describing a variable transform and a factory for variables.VariableTransform variables. Can be used separately, or in conjunction with .yaml files and hydra.initialize, hydra.compose, and hydra.utils.instantiate.

Parameters:

output_var (str) – Name of the output variable
input_vars (list[str]) – List of input variables needed to create the output variable
formula (str) – A Wilkinson formula supported by formulae. See https://bambinos.github.io/formulae/notebooks/getting_started.html#User-guide.
after_forecast (bool) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If after_forecast is True, the variable is estimated/calculated in the forecast step.

get_variables() → VariableTransform[source]#

Helper function to create a VariableTransform.

Return type:: VariableTransform

Endogen simulation#

class endogen.endogen.EndogenousSystem(input_data: str | ~os.PathLike | ~pandas.core.frame.DataFrame, time_var: str, unit_var: str, nsim: int, end: int, vars: ~typing.Sequence[str] | None = <factory>, start: int | None = None, include_past_n: int | None = None)[source]#

An endogenous panel-data system of models/nodes with associated methods for correct scheduling of model forecasts.

Parameters:

input_data (str or pandas.DataFrame) – Panel data (or path to data) that includes all variables required by the forecasting system (and possibly fitting of models).
time_var (str) – The variable name indicating time in input_data.
unit_var (str) – The variable name indicating units in input_data.
nsim (int) – The number of independent simulations of the endogenous system.
start (int) – The number on the same scale as time_var when forecasting should start.
end (int) – The number on the same scale as time_var when forecasting should end.
vars (Optional[Sequence[str]]) – A subset of variables in input_data. Defaults to all variables in input_data.
include_past_n (Optional[int]) – How much of the past to include when fitting statistical models.

plot(var: str, unit: Sequence[int] | None = None, start: int | None = None, *args, **kwargs)[source]#

Plot method for historical and forecasted data.

Parameters:

var (str) – Name of the variable you want to plot
unit (Optional[Sequence[int]]) – List of subset of units you want to plot in facets. Plot will otherwise show global statistics.
start (Optional[int]) – Alternative start time. Plot will otherwise start as early as possible with the data given.
args – Other arguments to pass to seaborn.relplot
kwargs (key, value pairings) – Dictionary of keyword arguments to pass to seaborn.relplot

Returns:

An object managing one or more subplots that correspond to conditional data subsets with convenient methods for batch-setting of axes attributes.

Return type:

seaborn.FacetGrid

class endogen.endogen.ModelController[source]#

A controller for organizing and scheduling models.

add_models(models: InputModel | ExogenModel | Sequence[InputModel | ExogenModel]) → None[source]#

Adds a model to the system.

Parameters:: model (VariableModel) – Any model type supported by the VariableModel class.

class endogen.endogen.ModelSchedule(delta_t: int, schedule: Iterable[str | Iterable[str]])[source]#

class endogen.variables.Variable(output_var: str, subset: int)[source]#

A variable class that holds the information necessary to represent a variable model or transformation in the simulation system. Not used for statistical models (see config.InputModel).

Parameters:

output_var (str) – Name of the output variable
subset (int) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If the variable is subset == 0, it is estimated/calculated in the transformation step, if it is 1, it is in the forecast step.

Raises:

NotImplementedError – This is just the base class that other Variable classes inherits. This class is not meant to be instantiated.

calc(xd: Dataset) → DataArray[source]#

A function that takes input data that includes input-variables required to calculate the output_var.

Parameters:: xd (xarray.Dataset) – A xarray.Dataset, often EndogenousSystem._xa (simulation data), or the EndogenousSystem._past.
Returns:: Returns a DataArray with the output_var, properly indexed.
Return type:: xarray.DataArray
Raises:: NotImplementedError – This is just the base class that other Variable classes inherits. This class is not meant to be instantiated.

property node: Tuple[str, Mapping[str, Any]]#

A node representation that interface well with NetworkX graphs.

Returns:: A tuple where the first element is the output variable name (“node”), and the second element is a dictionary of node data.
Return type:: Tuple[str, Mapping[str, Any]]

class endogen.variables.VariableDifference(output_var: str, subset: int, input_var: str, type: Literal['growth'])[source]#

A Variable class for the 1-year growth transformation function. There is room for generalization here.

Parameters:

output_var (str) – Name of the output variable
subset (int) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If the variable is subset == 0, it is estimated/calculated in the transformation step, if it is 1, it is in the forecast step.
input_var (str) – Name of the input variable
type (Literal["growth]) – Currently only supports “growth” as case. Any temporal difference function could in principle be supported here.

calc(xd: Dataset) → DataArray[source]#

A growth function that takes a single input variable and outputs the growth from t-1 to t. This calculates the exact growth, and not growth based on log difference. Uses xarray internal functions.

Warning! Divide by zero is a possibility here.

Parameters:: xd (xarray.Dataset) – A xarray.Dataset, often EndogenousSystem._xa (simulation data), or the EndogenousSystem._past.
Returns:: Returns a DataArray with the output_var, properly indexed.
Return type:: xarray.DataArray

class endogen.variables.VariableLag(output_var: str, subset: int, input_var: str, num_lag: int)[source]#

A Variable class for lagged variables (temporal offset). Uses xarray.DataArray.shift.

Parameters:

output_var (str) – Name of the output variable
subset (int) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If the variable is subset == 0, it is estimated/calculated in the transformation step, if it is 1, it is in the forecast step.
input_var (str) – Name of the input variable
num_lag (int) – How many time-units to offset. E.g., 1 would lag a time-series 1 time-unit compared to the input_var.

calc(xd: Dataset) → DataArray[source]#

Calculates the lagged output variable given a xarray.Dataset that includes the input variable.

Parameters:: xd (xarray.Dataset) – A xarray.Dataset, often EndogenousSystem._xa (simulation data), or the EndogenousSystem._past.
Returns:: Returns a DataArray with the output_var, properly indexed.
Return type:: xarray.DataArray

class endogen.variables.VariableRolling(output_var: str, subset: int, input_var: str, window: int, fun: Literal['mean', 'sum'], window_type: Literal['normal', 'span', 'com', 'halflife', 'alpha'])[source]#

A Variable class for rolling time-series functions. Rolling means and sums, as well as exponentially weighted moving windows are supported. See xarray.Dataset.rolling and xarray.Dataset.rolling_exp for details.

Parameters:

output_var (str) – Name of the output variable
subset (int) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If the variable is subset == 0, it is estimated/calculated in the transformation step, if it is 1, it is in the forecast step.
input_var (str) – Name of the input variable
window (int) – The window size in time-units.
fun (Literal["mean", "sum]) – The aggregation function (rolling mean or rolling sum).
window_type (Literal["normal", "span", "com", "halflife", "alpha"]) – “normal” is equally weighted. See pandas.DataFrame.ewm for details on the rest.

calc(xd: Dataset) → DataArray[source]#

Calculates the temporally rolling transformations given an xarray.Dataset that includes the input_var. Uses xarray internal functions.

Parameters:

xd (xarray.Dataset) – A xarray.Dataset, often EndogenousSystem._xa (simulation data), or the EndogenousSystem._past.

Returns:

Returns a DataArray with the output_var, properly indexed.

Return type:

xarray.DataArray

Raises:

NotImplementedError – fun must be “mean” or “sum”. Any other aggregation function is not supported.
NotImplementedError – window_type must be “normal”, “span”, “com”, “halflife”, or “alpha”. Other types are not supported.

class endogen.variables.VariableSingleEdge(output_var: str, subset: int, input_var: str)[source]#

A variable class that holds the information necessary to represent a variable model or transformation in the simulation system. Not used for statistical models (see config.InputModel). Helper class that I will probably regret. Used for all Variables that only take one input variable.

Parameters:

output_var (str) – Name of the output variable
subset (int) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If the variable is subset == 0, it is estimated/calculated in the transformation step, if it is 1, it is in the forecast step.
input_var (str) – Name of the input variable

property edges: List[Tuple[str, str]]#

An edge representation (list of tuples) that interface well with NetworkX graphs.

Returns:: A graph edge describing the link between the input_var and the output_var.
Return type:: List[Tuple[str, str]]

class endogen.variables.VariableTransform(output_var: str, subset: int, formula: str, input_vars: ~typing.List[str] = <factory>)[source]#

A class holding the information necessary to do a mathematical transformation of input-variables to calculate a new output variable. This class only supports transformations that are static in time, e.g., log-transformations, multiplications, etc. It is using formulae.design_matrices() to do the transformation.

Caution! The Wilkinson formula language have some gotcha’s. For instance, “var1 + var2” will give you a matrix with two columns, not a matrix with one column being the sum of var1 and var2. To achieve the latter, you need to write “I(var1 + var2)”. Any numpy (“np.”) and scipy (“scipy.”) tranformative function is in principle also supported. E.g., “np.sum(var1, var2)” would achieve the same. It is also important to note the difference between “var1:var2” and “var1*var2”.

Warning! Currently, there is nothing stopping you making a design matrix with multiple columns. This is not supported.

Parameters:

output_var (str) – Name of the output variable
subset (int) – Endogenous simulation requires that all variables are fully specified in a circular fashion. At the same time, there cannot be any circular definitions in the transformation step, nor in the forecast step. If the variable is subset == 0, it is estimated/calculated in the transformation step, if it is 1, it is in the forecast step.
formula (str) – A Wilkinson formula supported by formulae. See https://bambinos.github.io/formulae/notebooks/getting_started.html#User-guide.
input_vars (list[str]) – List with names of the input variables.

calc(xd: Dataset) → DataArray[source]#

A variable transformation function that takes input data that includes input-variables required to calculate the output_var.

Parameters:: xd (xarray.Dataset) – A xarray.Dataset, often EndogenousSystem._xa (simulation data), or the EndogenousSystem._past.
Returns:: Returns a DataArray with the output_var, properly indexed.
Return type:: xarray.DataArray

property edges: List[Tuple[str, str]]#

The edges between input variables and the transformed output variable that interface well with NetworkX graphs.

Returns:: A list of graph edges describing the link between the input_var and the output_var.
Return type:: List[Tuple[str, str]]

endogen.adapter_mlforecast.forecast_mlforecast(t: int, s: int, model: MLForecast, xdata: Dataset, pnames: PanelUnits, output_var: str, input_vars: list[str], levels=list[float]) → DataFrame[source]#

A prediction function adapter for MLForecast fitted with PredictionIntervals drawing predictions from the (approx./stepwise) full predictive distribution.

Parameters:

t (int) – Time index to forecast
s (int) – Simulation index to forecast
model (MLForecast) – The MLForecast object
xdata (xarray.Dataset) – The input data used in forecasting
pnames (PanelUnits) – The internal index naming convention used.
output_var (str) – The output variable
input_vars (list[str]) – The list of input variables
levels (list[float]) – The list of prediction intervals that the MLForecast model has been fitted with.

Returns:

A properly indexed pandas.DataFrame with a single draw from the full predictive distribution for all units at time t.

Return type:

pd.DataFrame

endogen.adapter_mlforecast.percentile_hi_lo(interval: float, type: Literal['lo', 'hi']) → float[source]#

Calculate lo/hi percentile from predictive interval

Parameters:

interval (float) – The predictive interval in percent.
type (Literal["lo", "hi"]) – Whether to return the low or high percentile from the interval.

Returns:

A percentile from 0 - 1.

Return type:

float

endogen.adapter_mlforecast.setup_mlforecast_bins(model: MLForecast, levels=list[float]) → tuple[dict[str, str], list[float]][source]#

A helper function to rename columns from MLForecast models with PredictiveIntervals. These are named in terms of the prediction interval and a lo/hi indicator. This function converts this to percentiles and returns a dictionary for easy renaming in Pandas.

The idea here is to create an equally binned histogram over the percentiles in a prediction distribution, finding the prediction at the middle of each bin.

Parameters:

model (MLForecast) – A mlforecast.forecast.MLForecast model object
levels (list[float]) – A list of prediction intervals. E.g., [50, 90] gives the 50% and 90% prediction interval

Returns:

A tuple with a dictionary with the renaming scheme for use in Pandas and the list of percentiles.

Return type:

tuple[dict[str, str], list[float]]

Misc utilities#

endogen.data_utilities.compare_to_most_recent(df: DataFrame, time_var: str, unit_var: str, alternative_time_comparison: int | None = None) → tuple[list[tuple[int, Set[int]]], list[tuple[int, Set[int]]]][source]#

Compares all temporal cross-sections with the most recent, and finds the superfluous and missing units. Superfluous units are units found in cross-sections that is not the most recent, but is not found in the most recent cross-section. Missing units are units that are found in the most recent cross-section, but not found in other cross-sections.

Parameters:

df (pd.DataFrame) – A dataframe with panel data
time_var (str) – A time-index variable of integer type (e.g., year).
unit_var (str) – A unit-index variable of integer type (e.g., gwcode)
alternative_time_comparison (int) – An alternative comparison period to use instead of the most recent. Must be an integer that is found in the time_var column.

Returns:

_description_

Return type:

tuple[list[tuple[int, Set[int]]], list[tuple[int, Set[int]]]]

endogen.data_utilities.drop_missing_units(df: DataFrame, time_var: str, unit_var: str, alternative_time_comparison: int | None = None)[source]#

Drop missing units from the dataset.

Parameters:

df (pd.DataFrame) – A dataframe with panel data
time_var (str) – A time-index variable of integer type (e.g., year).
unit_var (str) – A unit-index variable of integer type (e.g., gwcode)
alternative_time_comparison (int) – An alternative comparison period to use instead of the most recent. Must be an integer that is found in the time_var column.

Returns:

A dataframe where years with missing units of observation compared to the latest time-period is dropped.

Return type:

pd.DataFrame

endogen.data_utilities.drop_superfluous(df: DataFrame, time_var: str, unit_var: str, alternative_time_comparison: int | None = None) → DataFrame[source]#

Drop superfluous units from the dataset.

Parameters:

df (pd.DataFrame) – A dataframe with panel data
time_var (str) – A time-index variable of integer type (e.g., year).
unit_var (str) – A unit-index variable of integer type (e.g., gwcode)
alternative_time_comparison (int) – An alternative comparison period to use instead of the most recent. Must be an integer that is found in the time_var column.

Returns:

A dataframe where units of observation that is not in the lastest time-period is dropped.

Return type:

pd.DataFrame

endogen.data_utilities.generate_comparison_report(df: DataFrame, time_var: str, unit_var: str, alternative_time_comparison: int | None = None) → DataFrame[source]#

A report that is useful to understand how to build a complete and balanced dataset from a input panel data.

Parameters:

df (pd.DataFrame) – A dataframe with panel data
time_var (str) – A time-index variable of integer type (e.g., year).
unit_var (str) – A unit-index variable of integer type (e.g., gwcode)
alternative_time_comparison (int) – An alternative comparison period to use instead of the most recent. Must be an integer that is found in the time_var column.

Returns:

A dataframe with one observation per time-unit, indicating the superfluous and missing units across time.

Return type:

pd.DataFrame

endogen.data_utilities.read_input_data(input_data: str | PathLike | DataFrame) → DataFrame[source]#

Reads input data to be used in estimation of statistical models and as initial values for simulation.

Parameters:

input_data (str | os.PathLike | pd.DataFrame) – Path to input data or a pandas.DataFrame object. Supports .csv and .parquet files.

Return type:

pandas.DataFrame

Raises:

NotImplementedError – Currently only supports .csv and .parquet files.
ValueError – Input data must be pandas.DataFrame if not path to .csv or .parquet file.

class endogen.utilities.PanelUnits(time_var: str, unit_var: str)[source]#

Utility class for describing panel-unit names.

The user might want to use different naming conventions than we do under-the-hood. Since we use nixtla for parts of our tasks, we use their naming conventions (“ds” for time and “unique_id” for units).

Parameters:

time_var (str) – The name of the time variable as given by the user
unit_var (str) – The name of the unit variable as given by the user

property external_index: Sequence[str]#: Return the index sequence. Useful to get a consistent use across code.

property internal_index: Sequence[str]#: Return the index sequence. Useful to get a consistent use across code.

to_dict(inv: bool = False) → Mapping[str, str][source]#

Return a mapping of user-provided and internal ids

Parameters:: inv (bool) – Invert the mapping. {internal: user} instead of {user: internal}

endogen.tools.flatten_recursive_generator(lst: List[Any]) → Iterable[Any][source]#: Flatten a list using recursion.