Skip to content Skip to sidebar Skip to footer

Statmodels In Python Package, How Exactly Duplicated Features Are Handled?

I am a heavy R user and am recently learning python. I have a question about how statsmodels.api handles duplicated features. In my understanding, this function is a python version

Solution 1:

The short answer:

GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.

statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.

The default optimization algorithm in GLM.fit is iteratively reweighted least squares irls which uses WLS and inherits the default behavior of WLS for singular design matrices. The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.


Post a Comment for "Statmodels In Python Package, How Exactly Duplicated Features Are Handled?"