paste3.glmpca.glmpca

paste3.glmpca.glmpca(Y, L, fam='poi', ctl=None, penalty=1, init=None, nb_theta=100, X=None, Z=None, sz=None)[source]

GLM-PCA This function implements the GLM-PCA dimensionality reduction method for high-dimensional count data. The basic model is R = AX'+ZG'+VU', where E[Y]=M=linkinv(R). Regression coefficients are A and G, latent factors are U, and loadings are V. The objective function being optimized is the deviance between Y and M, plus an L2 (ridge) penalty on U and V. Note that glmpca uses a random initialization, so for fully reproducible results one should set the random seed. :type Y: :param Y: columns. :type Y: array_like of count data with features as rows and observations as :type L: :param L: :type L: the desired number of latent dimensions (integer). :type fam: :param fam: :type fam: string describing the likelihood to use for the data. Possible values include: :param - poi: :type - poi: Poisson :param - nb: :type - nb: negative binomial :param - mult: :type - mult: binomial approximation to multinomial :param - bern: :type - bern: Bernoulli :type ctl: Optional[dict] :param ctl: :type ctl: a dictionary of control parameters for optimization. Valid keys: :param - maxIter: :type - maxIter: an integer, maximum number of iterations :param - eps: :type - eps: a float, maximum relative change in deviance tolerated for convergence :param - optimizeTheta: distribution is optimized (default), or fixed to the value provided in nb_theta. :type - optimizeTheta: a bool, indicating if the overdispersion parameter of the NB :type penalty: :param penalty: Regression coefficients are not penalized. :type penalty: the L2 penalty for the latent factors (default = 1). :type init: Optional[dict] :param init: loadings (V) matrices. :type init: a dictionary containing initial estimates for the factors (U) and :type nb_theta: :param nb_theta: if nb_theta goes to infinity, this is equivalent to Poisson

Note that the alpha in the statsmodels package is 1/nb_theta. If ctl["optimizeTheta"] is True, this is used as initial value for optimization

Parameters:

X (array_like of column (observations) covariates. Any column with all) -- same values (eg. 1 for intercept) will be removed. This is because we force the intercept and want to avoid collinearity.
Z (array_like of row (feature) covariates, usually not needed.) --
sz (numeric vector of size factors to use in place of total counts.) --

Returns:

A dictionary with the following elements
- factors (an array U whose rows match the columns (observations) of Y. It is analogous to the principal components in PCA. Each column of the factors array is a different latent dimension.)
- loadings (an array V whose rows match the rows (features/dimensions) of Y. It is analogous to loadings in PCA. Each column of the loadings array is a different latent dimension.)
- coefX (an array A of coefficients for the observation-specific covariates array X. Each row of coefX corresponds to a row of Y and each column corresponds to a column of X. The first column of coefX contains feature-specific intercepts which are included by default.)
- coefZ (a array G of coefficients for the feature-specific covariates array Z. Each row of coefZ corresponds to a column of Y and each column corresponds to a column of Z. By default no such covariates are included and this is returned as None.)
- dev (a vector of deviance values. The length of the vector is the number of iterations it took for GLM-PCA's optimizer to converge. The deviance should generally decrease over time. If it fluctuates wildly, this often indicates numerical instability, which can be improved by increasing the penalty parameter.)
- glmpca_family (an object of class GlmpcaFamily. This is a minor wrapper to the family object used by the statsmodels package for fitting standard GLMs. It contains various internal functions and parameters needed to optimize the GLM-PCA objective function. For the negative binomial case, it also contains the final estimated value of the dispersion parameter nb_theta.)

Examples

1) create a simple dataset with two clusters and visualize the latent structure # >>> from numpy import array,exp,random,repeat # >>> from matplotlib.pyplot import scatter # >>> from glmpca import glmpca # >>> mu= exp(random.randn(20,100)) # >>> mu[range(10),:] *= exp(random.randn(100)) # >>> clust= repeat(["red","black"],10) # >>> Y= random.poisson(mu) # >>> res= glmpca(Y.T, 2) # >>> factors= res["factors"] # >>> scatter(factors[:,0],factors[:,1],c=clust)

References