Analyze global clustering¶

The global clustering performed along the genome and jointly across samples is a crucial feature of HATCHet and the quality of the final results is strongly affected by the quality of the clustering. This global clustering is performed by HATCHet’s component cluster-bins, whose default values are suitable for many datasets. However, for ideal results on specific datasets these parameters may need to be modified.

The module cluster-bins incorporates genomic position to improve clustering using a Gaussian hidden Markov model (GHMM), as opposed to the position-agnostic Gaussian mixture model (GMM) used in cluster-bins-gmm and described in the original HATCHet publication. This page describes how to tune the parameters of cluster-bins – for recommendations on cluster-bins-gmm, see this page instead.

The user should validate the results of the clustering, especially in noisy or suspicious cases, through the cluster figures produced by plot-bins and plot-bins-1d2d. More specifically, we suggest the following criteria to evaluate the clustering:

Every pair of clusters should be clearly distinct in terms of RDR or BAF in at least one sample, and
Each cluster should contain regions with similar values of RDR and BAF in all samples

cluster-bins offers several parameters that can be used to tune the clustering.

Number of clusters¶

By default, cluster-bins tries several possible values for the number K of clusters and selects the one that maximizes the silhouette score. In practice, this tends to underestimate the number of clusters that are visually apparent. This can be modified by

Setting the parameters --minK and --maxK which specify the minimum and maximum number of clusters to consider, or
Setting the parameter --exactK to fix the number of clusters to a given value.

Model parameters¶

Some parameters of the model can be tuned to change the clustering. The most useful one is the value --tau, which corresponds to the probability of transitioning between different copy-number states (i.e., the initial value for off-diagonal entries in the transition matrix). In practice, tau controls the balance between global information (RDR and BAf across samples) and local information (assigning adjacent bins to the same cluster): smaller values of tau put more weight on local information, and larger values of tau put more weight on global information. It may be appropriate to reduce tau by several orders of magnitude for noisier or lower-coverage datasets, as in this case the global RDR/BAF values are less reliable.

Other parameters (below) are available to change the structure of the model, although in practice I have not found them particularly helpful in tuning the clustering.

Name	Description	Usage	Default
`-t`, `--transmat`	Type of transition matrix to infer	`fixed` (to off-diagonal = tau), `diag` (all diagonal elements are equal, all off-diagonal elements are equal) or `full` (freely varying)	`diag`
`-c`, `--covar`	Type of covariance matrix to infer	options described in hmmlearn documentation	`diag`
`-x`, `--decoding`	Decoding algorithm to use to infer final estimates of states	`map` for MAP inference, `viterbi` for Viterbi algorithm	`map`