Analyze HATCHet’s inference¶

The main component of HATCHet (compute-cn step) performs three major tasks: (1) explicit estimation of fractional copy numbers, (2) inference of allele and clone-specific copy numbers, and (3) joint prediction of number of clones and whole-genome duplication. In the following, we guide the user in the interpration of HATCHet’s inference, we explain how to perform quality control to guarantee the best-quality results, and we describe how the user can control and tune some of the parameters to obtain the best-fitting results. In fact, it is important to assess the quality of the results from each of these steps to guarantee the best-quality results, especially when considering datasets with high noise or special features.

Estimation of fractional copy numbers
Inference of allele and clone-specific copy numbers
Joint selection of number of clones and whole-genome duplication
Quality control and suspicious cases

1. Estimation of fractional copy numbers¶

HATCHet estimates the fractional copy numbers for all segments by identifying 1 or 2 tumor clonal clusters (i.e, a cluster which have the same CNA in all tumor clones). First, HATCHet selects only some of the clusters as potential clonal clusters and, next, it aims to find consistent combinations of clonal clusters.

Selecting potential clonal clusters¶

HATCHet selects potential clonal clusters as sufficiently large clusters; the list of the selected clusters is reported in the LOG of the HATCHet step starting with line # Selected clusters. This list reports the cluster size and the corresponding pairs of values (RDR, BAF) in each sample (alphabetically sorted) and, therefore, can be easily used to map these clusters to the bb_clustered.pdf figure generated by the command CBB of plot-bins or the plot-bins-1d2d command. For example, we have the following:

# Selected clusters: 11, 10, 12, 15, 21, 22, 23
## Features of selected clusters:
## 11: SIZE= 169350000.0	#CHRS= 6	(RDR, BAF)= (0.98300155842, 0.0717476641159)-(0.70901452517, 0.0218274427111)-(0.793836783391, 0.0201025034921)
## 10: SIZE= 31250000.0	#CHRS= 2	(RDR, BAF)= (1.26230102783, 0.5)-(0.993581127953, 0.39042851007)-(1.08735153846, 0.411521706019)
## 12: SIZE= 187400000.0	#CHRS= 8	(RDR, BAF)= (1.54296926478, 0.399215013858)-(1.28029494725, 0.494771052557)-(1.38353348068, 0.415539785049)
## 15: SIZE= 57450000.0	#CHRS= 2	(RDR, BAF)= (0.70122709154, 0.406572560257)-(0.96296886062, 0.419918229626)-(0.869811018537, 0.415889001548)
## 21: SIZE= 73000000.0	#CHRS= 1	(RDR, BAF)= (0.982893169988, 0.350795164506)-(0.88867098782, 0.410185588799)-(0.917182445956, 0.415596024012)
## 22: SIZE= 71600000.0	#CHRS= 1	(RDR, BAF)= (0.982359985531, 0.350709236531)-(1.2496122809, 0.497653731887)-(1.16543827292, 0.496215824202)
## 23: SIZE= 106150000.0	#CHRS= 10	(RDR, BAF)= (0.701788866503, 0.100467817792)-(0.602518459603, 0.0254102909342)-(0.621971921092, 0.0253841073015)

The size of the clusters that are considered is defined by the parameter -ts which defines the minimum fraction of the genome that needs to be covered by a potential clonal cluster. To deal with special or noisy datasets, the user can either decrease the value of this threshold (e.g. -ts 0.001, 0.1% fraction of the genome) to include clusters which have been erroneously excluded or increase the value of this threshold (e.g. -ts 0.03, 3% fraction of the genome) to exclude noisy clusters. Including all the most significant clusters is crucial to guarantee a correct identification of the clonal clusters and their copy numbers. For example, assume to have the following clusters:

## 0:   SIZE= 498    #CHRS= 22    (RDR, BAF)= (1.0, 0.5)
## 1:   SIZE= 497    #CHRS= 22    (RDR, BAF)= (0.7, 0.25)
## 2:   SIZE= 5      #CHRS= 10    (RDR, BAF)= (1.0, 0.00)

When using the default value of -ts 0.008, the cluster 2 will be excluded as it is covering a fraction of the genome smaller than the given threshold; as a consequence, the other cluster 1 appears to be the one with maximum shift of BAF and, therefore, with a copy-number state equal for example to (2, 0) (i.e. a loss-of-heterozygosity, LOH). However, assuming 2 is not due to noise but is providing true information, 2 should be the one with maximum shift of BAF and with a state (2, 0) corresponding to a LOH. Therefore, the user should identify the absence of cluster 2 by using the bb_clustered.pdf figure generated by command CBB of plot-bins and use a lower threshold -ts 0.005 to include it.

Identifying combinations of clonal clusters¶

HATCHet chooses the largest combination of consistent clonal clusters to obtain the needed clonal cluster and to estimate the fractional copy numbers, when assuming the occurrence of a WGD. The combinations considered by HATCHet are reported in the LOG of the HATCHet step, after the inferences with the absence of a WGD (# Running diploid) and starting with line # Finding clonal clusters and their copy numbers. Each combination (also called pattern) is reported by first specifying the total size of the involved clusters and next the copy number state of each cluster. For example, we have the following:

# Finding clonal clusters and their copy numbers
## Found pattern of size 1080892751.0: {'8': (2, 1), '28': (3, 2), '50': (2, 2), '29': (4, 2), '23': (2, 0)}
## Found pattern of size 906528933.0: {'60': (2, 1), '12': (3, 2), '50': (2, 2)}
## Found pattern of size 712147541.0: {'11': (2, 0), '50': (2, 2)}
## Found pattern of size 835497541.0: {'23': (2, 0), '28': (3, 2), '50': (2, 2), '29': (4, 2), '21': (2, 1)}
## Found pattern of size 886592751.0: {'8': (2, 0), '28': (4, 2), '50': (2, 2)}
## Chosen pattern of size 1080892751.0: {'8': (2, 1), '28': (3, 2), '50': (2, 2), '29': (4, 2), '23': (2, 0)}

There are two parameters which control the identification of these patterns, which are -tR, i.e. RDR threshold, and -tB, i.e. BAF threshold. These thresholds are considered as the maximum allowed errors when estimating the clonal clusters. As such, the user can either decrease the thresholds (e.g. -ts 0.03 and -ts 0.02) to consider more stringent constraints especially in low-noise datasets where subclonality can be accurately identified, or increase the thresholds (e.g. -ts 0.15 and -ts 0.05) to accomodate the higher noise in certain datasets. The user can use the bb_clustered.pdf figure produced by the command CBB of plot-bins to assess the identified clonal cluster. For example, in the following case

## 0:   SIZE= 498    #CHRS= 22    (RDR, BAF)= (1.0, 0.5)
## 1:   SIZE= 497    #CHRS= 22    (RDR, BAF)= (0.7, 0.25)
## 2:   SIZE= 5      #CHRS= 10    (RDR, BAF)= (1.0, 0.00)

The users can easily see that the following pattern would be suspicious and wrong:

## Chosen pattern of size 995: {`0` : (2, 2), `1` : (2, 0)}

because in the bb_clustered.pdf figure there is another clear cluster 2 which have a significantly higher shift in the BAF and, therefore, 1 cannot have a state (2, 0) which would correspond to the highest shift of BAF.

2. Inference of allele and clone-specific copy numbers¶

HATCHet infers allele and clone-specific copy numbers by first assuming the absence of a WGD (# Running diploid) and, next, by assuming the presence of a WGD (# Running tetraploid). In each of the two cases, HATCHet considers an increasing number of clones (including the normal diploid clone). Typically, the interval starts from 2 (i.e. 1 tumor clone) up to 6-8 clones which is the typical maximum number of clones in terms of CNAs; however, the user can consider larger numbers, especially when suspecting the presence of more clones, by controlling the parameter -n, e.g. typical intervals are -n 2,6, -n 2,8, -n 2,10, -n 2,12, …

Maximum copy numbers and minimum clone proportion¶

For each assumpotion and number of clones, HATCHet infers allele and clone-specific copy-numbers by using two main parameters, the maximum copy numbers (-eD and -eT) and the minimum clone proportion (-u). These parameters are very important to constrain the solution space depending on the needs of the user, in particular the minimum clone proportion -u is particularly important to deal with noisy datasets and we reccomend the user to follow the criterion described here below:

Maximum copy numbers (-eD and -eT): these values define the maximum value of the total copy number when assuming the absence of a WGD (-eD) and the presence of a WGD (-eT). By default the value of these parameters is 0 meaning that these values are inferred from the estimated fractional copy numbers. However, we recommend the user to try to use common values, especially in the first attempts, (e.g. 5, 6, 8 for -eD and 8, 10, 12 for eT) to constraint the searching space and avoid that noisy clusters introduce outlying values.
Minimum clone proportion -u: this value defines the minimum clone proportion for the inferred tumor clones. Default values are pretty small, e.g. 0.03 or 0.05, and allow to infer tumor clones present in small proportions. However, the power of the inference and the capability of inferring tumor clones with small proportions strongly depend on the noise of the data. As such, especially for noisy and special datasets it is very important that the user considers higher values when possible, e.g. 0.1-0.15. We reccomend to consider the following criterion to avoid overfitting and find the best fitting value for -u: start from very small minimum clone proprtions and increase the value whenever the inferred solutions contain clones with clone proportions equal to this threshold.

3. Joint selection of number of clones and whole-genome duplication¶

For each number of clones N, the related copy numbers are computed by HATCHet and placed in two output files:

A BBC-UCN file which adds to the input BBC file the copy-number state and proportion inferred for each clone (the normal clone is always the first). The file is named results.diploid.nN.bbc.ucn.tsv or results.tetraploid.nN.bbc.ucn.tsv for the solutions obtained when considering the absence or presence of a WGD, respectively.
A SEG-UCN file which combines neighboring bins with the same copy-number states into segments and for each segment it reports the copy-number state and proportion of each clone. The file is named results.diploid.nN.seg.ucn.tsv or results.tetraploid.nN.seg.ucn.tsv for the solutions obtained when considering the absence or presence of a WGD, respectively.

HATCHet selects the best solution under each of the two assumptions through a model selection criterion; as such, it copies the best diploid solutions to the files chosen.diploid.bbc.ucn and chosen.diploid.seg.ucn, and the best tetraploid solution to the two files chosen.tetraploid.bbc.ucn and chosen.tetraploid.seg.ucn. Each choice is based on the value of an elbow function (called score) computed for each number of clones. This value approximates the second derivative of the factorization objective function (called OBJ) and the choice is based on the maximum score; as such the best number of clones is chosen as the number which significantly improves the objective function with respect to lower number of clones but no significant subsequent improvements are observed with increasing number of clones. All these values are summarized in the LOG of the HATCHet step by firts summarizing the values for diploid solutions and next those of tetraploid solutions, e.g.

## Scores approximating second derivative for diploid results
## Diploid with 2 clones - OBJ: 66.804147 - score: -0.176997273837
## Diploid with 3 clones - OBJ: 34.938751 - score: -0.020688175766
## Diploid with 4 clones - OBJ: 17.550243 - score: 0.0543991087813
## Diploid with 5 clones - OBJ: 9.77046 - score: 0.203750740655
## Diploid with 6 clones - OBJ: 7.430087 - score: 0.0541415395045
## Diploid with 7 clones - OBJ: 6.052593 - score: 0.0380373822927
## Diploid with 8 clones - OBJ: 5.160703 - score: -0.152643321631
## Scores approximating second derivative for tetraploid results
## Tetraploid with 2 clones - OBJ: 81.064423 - score: -0.414934454045
## Tetraploid with 3 clones - OBJ: 23.108674 - score: 0.172765083795
## Tetraploid with 4 clones - OBJ: 15.546709 - score: 0.103011062784
## Tetraploid with 5 clones - OBJ: 12.060766 - score: 0.023008690139
## Tetraploid with 6 clones - OBJ: 9.633957 - score: 0.0339979959232
## Tetraploid with 7 clones - OBJ: 8.022994 - score: 0.104757566866
## Tetraploid with 8 clones - OBJ: 7.521881 - score: -0.237540399507

The user should analyze the scores as HATCHet provides the following two parameters (controlling different hypotheses) to investigate alternative solutions with also high scores:

Sensitivity to small CNAs -l: this parameter controls the sensitivity of HATCHet to small CNAs, whose typical values are 0.5-0.6. The user should decrease this value, e.g. 0.2-0.3, to investigate solutions with more clones and smaller CNAs, while it should increase the value, e.g. 1.0-1.5, to give more confidence to large CNAs and less to small CNAs.
Confidence in a single tumor clone -g: this parameter controls the confidence in the presence of a single tumor clone, whose typical values are 0.2-0.3. The user should increase the value to increase the confidence, e.g. 0.4-0.5, in the presence of a single tumor clone, while lower values, e.g. 0.0-0.1, decrease the confidenze and favor the presence of more clones. The value should be increase especially in 2 cases: (1) when the score of 2 clones is particularly high with or without a WGD (e.g. a value close to 0.0), and especially (2) when the score of 2 clones is significantly higher with a WGD than the score of 2 clones without a WGD; the latter may indeed indicate the presence of a single tetraploid clone.

The final best solution, according to prediction of the presence or absence of a WGD, is made based on a trade-off between the number of clones and WGD; more specifically, the diploid solution is chosen when it has the same or lower number of clones than the tetraploid solution, otherwise the tetraploid solution is chosen.

4. Quality control and suspicious cases¶

It is very important that the user verifies the results in different steps to guarantee the best-quality results, more specifically:

User can assess the validity of the chosen clonal cluster using the bb_clustered.pdf figure produced by the command CBB of plot-bins and comparing the list of selected potential clonal clusters.
User can assess the inferred copy numbers by analyzing the inferred maximum values and the inferred clone proportions which define the tumor clonal composition.
User can assess the joint inference of number of clones and WGD by analyzing the values of the objective function and the related scores.

There are some typical suspicious and warning cases that the user can identify from the analysis of the LOG of the HATCHet step:

Many diploid solutions have high scores.
Objective function of tetraploid solutions (i.e. with WGD) does not almost decrease/vary when increasing the number of clones. Even if this can occur because a single tetraploid tumor clone is present, this typically occurrs when the heuristic of HATCHet failed to identify a correct clonal cluster. This case is even more suspicious when the chosen number of diploid clones is much higher or when the objective function of tetraploid solutions is hugely higher that those of diploid solutions.
Inferred clone proportions are identical to the minimum clone proportion -u and tumor clones are present in all samples with very small proportions. Also, higher maximum copy numbers are needed when these result in much lower objective functions.
Huge difference between the number of clones inferred with and without a WGD, especially when the chosen diploid solution has a much lower number of clones than the chosen tetraploid solution.
Objective function that keeps decreasing significantly and objective function with very small values.
Score of 2 clones with a WGD is much higher then score of 2 clones without a WGD. This typically requires to increase the single-clone confidence of HATCHet -g to investigate the presence of a single tumor clone.

The user can consider the following parameters to investigate alternative solutions and better fitting:

Sensitivity -l controls how much the variance of data and clusters influence the choice of the number of clones. The user should increase the sensitivity (by decreasing the value of -l to 0.4, or 0.45, 0.5, or 0.3) when having high-purity or low variance samples to better investigate multiple solutions, especially when there are multiple solutions with higher scores (and especially when there are many diploid solutions with high scores) and when considering small CNAs. Conversely, the user should decrease the sensitivity (by increasing -l to 0.6, 0.8, 1.0) when considering low-purity or high-variance samples.
Minimum clone proportion -u, we suggest to increase this value whenever obtaing solutions with clone proportions idetical to -u or clones present in all samples with very small proportions.
Maximum copy numbers, both standard values and those inferred from fractional copy numbers (by providing -eD 0 -eT 0) should be investigated.
-g controls the confidence in the presence of a single tumor clone.
Higher values of -tR and -tR allow to increase the thresholds for the inferred clonal clusters in case of high noise in the data, while lower values allow to decrease error in case of low noise data.
-ts determines a thresholds for the clusters to consider as potential clonal clusters, higher values allow to improve accuracy while lower values include more information especially when small CNAs are present.