cluster-bins¶
This step globally clusters genomic bins along the entire genome and jointly across tumor samples.
cluster-bins clusters bins while also taking into account their locations on the genome to preferentially form clusters that correspond to contiguous genomic segments on chromosome arms.
The input/output files for cluster-bins are exactly the same as those for cluster-bins-gmm.
Input¶
cluster-bins takes in input a tab-separated file with the following fields.
| Field | Description |
|---|---|
CHR |
Name of a chromosome |
START |
Starting genomic position of a genomic bin in CHR |
END |
Ending genomic position of a genomic bin in CHR |
SAMPLE |
Name of a tumor sample |
RD |
RDR of the bin in SAMPLE |
#SNPS |
Number of SNPs present in the bin in SAMPLE |
COV |
Average coverage in the bin in SAMPLE |
ALPHA |
Alpha parameter related to the binomial model of BAF for the bin in SAMPLE, typically total number of reads from A allele |
BETA |
Beta parameter related to the binomial model of BAF for the bin in SAMPLE, typically total number of reads from B allele |
BAF |
BAF of the bin in SAMPLE |
The fields #SNPS, COV, ALPHA, and BETA are currently deprecated and their values are ignored.
Output¶
cluster-bins produces two tab-separated files:
A file of clustered genomic bins, specified by the flag
-O,--outbins. The tab separated file has the same fields as the input plus a last fieldCLUSTERwhich specifies the name of the corresponding cluster.A file of clustered genomic bins, specified by the flag
-o,--outsegments. The tab separated file has the following fields.
| Field | Description |
|---|---|
ID |
The name of a cluster |
SAMPLE |
The name of a sample |
#BINS |
The number of bins included in ID |
RD |
The RDR of the cluster ID in SAMPLE |
#SNPS |
The total number of SNPs in the cluster ID |
COV |
The average coverage in the cluster ID |
ALPHA |
The alpha parameter of the binomial model for the BAF of the cluster ID |
BETA |
The beta parameter of the binomial model for the BAF of the cluster ID |
BAF |
The BAF of the cluster ID in SAMPLE |
Main parameters¶
cluster-binshas a parameter-d,--diploidbafthat specifies the maximum expected shift from 0.5 the BAF of a balanced cluster (i.e., diploid with copy-number state (1, 1) or tetraploid with copy-number state (2, 2)). This threshold is used to correct bias in the BAF of these balanced clusters. The default value of this parameter (0.1) is often sufficient, but the most appropriate value will vary depending on noise and coverage. In general, this value should be set to include only those clusters that are closest to 0.5 – for example, if some clusters have centroids near 0.47 and others have centroids near 0.42, this parameter should be set to 0.035 or 0.04. To determine the best setting for this value, please check the plots produced byplot-binsand the centroid values describedbbc/bulk.seg(output from this command).By default,
cluster-binstakes as input a minimum number of clusters (--minK, default2) and maximum number of clusters (--maxK, default30), and chooses the numberKof clusters in this closed interval that maximizes the silhoutette score. Users can also specify an exact number of clusters (--exactK) to infer, which skips the model selection step.Other options are available to change aspects of the Gaussian Hidden Markov model (GHMM) that is used by
cluster-bins:
| Name | Description | Usage | Default |
|---|---|---|---|
--tau |
Off-diagonal value for initializing transition matrix | must be <= 1/(K-1) |
1e-6 |
-t, --transmat |
Type of transition matrix to infer | fixed (to off-diagonal = tau), diag (all diagonal elements are equal, all off-diagonal elements are equal) or full (freely varying) |
diag |
-c, --covar |
Type of covariance matrix to infer | options described in hmmlearn documentation | diag |
-x, --decoding |
Decoding algorithm to use to infer final estimates of states | map for MAP inference, viterbi for Viterbi algorithm |
map |
Particularly, tau controls the balance between global information (RDR and BAf across samples) and local information (assigning adjacent bins to the same cluster): smaller values of tau put more weight on local information, and larger values of tau put more weight on global information. It may be appropriate to reduce tau by several orders of magnitude for noisier or lower-coverage datasets.
Optional parameters¶
| Name | Description | Usage | Default |
|---|---|---|---|
-R, --restarts |
Number of restarts (initializations) | For each K, the HMM is initialized randomly this many times and the solution with the highest log-likelihood is kept | 10 |