Analyze global clustering¶
The global clustering performed along the genome and jointly across samples is a crucial feature of HATCHet and the quality of the final results is strongly affected by the quality of the clustering. This global clustering is performed by HATCHet’s component cluster-bins
, whose default values are suitable for many datasets. However, for ideal results on specific datasets these parameters may need to be modified.
The module cluster-bins
incorporates genomic position to improve clustering using a Gaussian hidden Markov model (GHMM), as opposed to the position-agnostic Gaussian mixture model (GMM) used in cluster-bins-gmm
and described in the original HATCHet publication. This page describes how to tune the parameters of cluster-bins
– for recommendations on cluster-bins-gmm
, see this page instead.
The user should validate the results of the clustering, especially in noisy or suspicious cases, through the cluster figures produced by plot-bins and plot-bins-1d2d. More specifically, we suggest the following criteria to evaluate the clustering:
Every pair of clusters should be clearly distinct in terms of RDR or BAF in at least one sample, and
Each cluster should contain regions with similar values of RDR and BAF in all samples
cluster-bins
offers several parameters that can be used to tune the clustering.
Number of clusters¶
By default, cluster-bins
tries several possible values for the number K
of clusters and selects the one that maximizes the silhouette score. In practice, this tends to underestimate the number of clusters that are visually apparent. This can be modified by
Setting the parameters
--minK
and--maxK
which specify the minimum and maximum number of clusters to consider, orSetting the parameter
--exactK
to fix the number of clusters to a given value.
Model parameters¶
Some parameters of the model can be tuned to change the clustering. The most useful one is the value --tau
, which corresponds to the probability of transitioning between different copy-number states (i.e., the initial value for off-diagonal entries in the transition matrix). In practice, tau
controls the balance between global information (RDR and BAf across samples) and local information (assigning adjacent bins to the same cluster): smaller values of tau
put more weight on local information, and larger values of tau
put more weight on global information.
It may be appropriate to reduce tau
by several orders of magnitude for noisier or lower-coverage datasets, as in this case the global RDR/BAF values are less reliable.
Other parameters (below) are available to change the structure of the model, although in practice I have not found them particularly helpful in tuning the clustering.
Name | Description | Usage | Default |
---|---|---|---|
-t , --transmat |
Type of transition matrix to infer | fixed (to off-diagonal = tau), diag (all diagonal elements are equal, all off-diagonal elements are equal) or full (freely varying) |
diag |
-c , --covar |
Type of covariance matrix to infer | options described in hmmlearn documentation | diag |
-x , --decoding |
Decoding algorithm to use to infer final estimates of states | map for MAP inference, viterbi for Viterbi algorithm |
map |