A common problem in building statistical models is determining which features to include in a model. Mathematical publications provide some suggestions, but there is no consensus. Some examples are the lasso or simply trying all possible combinations of predictors. Another option is stepwise search.
The more parameters a model has, the better it will fit the data. If the model is too complex, the worse it will perform on unseen data. AIC strikes a balance between fitting the training data well and keeping the model simple.
Using AIC, a search starts with no features. \[g(Y) = \beta_0\] Then each feature is considered. If there are 10 features, there are 10 models under consideration. For each model, AIC is calculated and the model with the lowest AIC is selected. In this case, X1 was selected. \[g(Y) = \beta_1X_1 + \beta_0\]
After the first feature is selected, all remaining 9 features are considered. Of the 9 features, the one with the lowest AIC is selected, creating a 2 feature model. In this round, X3 was selected. \[g(Y) = \beta_3X_3 + \beta_1X_1 + \beta_0\]
When adding more features does not improve AIC, the procedure stops.
How well does stepwise search work when there are unrelated variables? Is a large amount of data needed to find the correct variables? The below tests stepwise search in a variety of settings to answer these questions.
Stepwise search provides a computationally fast way to select features. When half the features were unrelated, the search found the correct model for both small and large n. When the majority of features were unrelated, stepwise found all related features and erroneously selected a few unrelated variables.