Analyze different type of data¶

The default values in the complete pipeline of HATCHet are typically used for analyzing whole-genome sequencing (WGS) data. However, when considering different type of data, as those from whole-exome sequencing (WES) data, users should adjust some of the parameters due to the different features of this kind of data. More specifically, there are 4 main points to consider when analyzing WES data:

Bin sizes. One can use the plots from plot-bins to test different parameters (--mtr and --msr for variable-width, bin size for fixed width) and inspect the amount of variance and/or the separation between apparent clusters.
- Variable-width Having a sufficient number of germline SNPs is needed to have good estimations with low variances for RDR and, especially, for the B-allele frequency (BAF) of each bin. Variable-width binning attempts to account for this by adjusting bin widths to ensure enough total and SNP-covering reads in each bin. You can tune the average bin width using the --msr (min. SNP-covering reads, default 5000) and --mtr (min. total reads, default 5000) parameters to combine-counts. Generally, --msr is more important because a bin with enough SNP-covering reads to get a good BAF estimate will almost certainly have enough total reads to get a good RDR estimate. Increasing these parameters produces larger bins (on average) with lower variance, while decreasing these values produces smaller bins (on average) with higher variance.
- Fixed-width (legacy) While a size of 50kb is standard for CNA analysis when considering whole-genome sequencing (WGS) data, data from whole-exome sequencing (WES) generally require to use large bin sizes in order to guarantee that each bin contains a sufficient number of heterozygous germline SNPs. As such, more appropriate bin sizes to consider may be 200kb or 250k when analyzing WES data; even larger bin sizes, e.g. 500kb, may be needed for noisy WES data.
Read-count thresholds. As suggested in the GATK best practices, count-alleles requires two parameters -c (the minimum coverage for SNPs) and -C (the maximum coverage for SNPs) to reliably call SNPs and exclude those in regions with artifacts. GATK suggests to consider a value of -C that is at least twice larger than the average coverage and -c should be large enough to exclude non-sequenced regions. For example, -c 6 and -C 300 are values previously used for WGS data whose coverage is typically between 30x and 90x. However, WES data are generally characterized by a much larger average coverage and thus require larger values, e.g. -c 20 and -C 600. These values are also very usefule to discard off-target regions. In any case, the user should ideally pick values according to the considered data.
Bootstrapping for clustering. (legacy cluster-bins-gmm only) Occasionally, WES may have very few points and much less data points than WGS. Only in these special cases with very few data points, the global clustering of cluster-bins-gmm may generally benefit from the integrated bootstrapping approach. This approach allow to generate a certain number of synthetic bins from the real ones to increase the power of the clustering. For example, the following cluster-bins-gmm parameters -u 20 -dR 0.002 -dB 0.002 allow to activate the bootstraping which introduces 20 synthetic bins for each real bin with low variances.
Bias in balanced BAF. Depending on the data type (WGS vs. WES) and the sequencing coverage, the argument -d, --diploidbaf in cluster-bins/cluster-bins-gmm may need to be adjusted to match the observed BAF bias. This threshold should be set so that the centroids of cluster(s) with BAF closest to 0.5 are included in the range [0.5 - d, 0.5] and other clusters with apparently different BAF are excluded.
Genomic regions. Provide a BED file defining the sequenced genomic regions when this is available. This helps to speed up the process.