combine-counts

This step constructs variable-length bins that ensure that each bin has at least some number (--msr) of SNP-covering reads and at least some number (--mtr) of total reads per bin. Then, it combines the read counts and the allele counts for the identified germline SNPs to compute the read-depth ratio (RDR) and B-allele frequency (BAF) of every genomic bin.

Input

combine-counts takes in input the output from count-reads (i.e., two gzipped files ch.total.gz and ch.thresholds.gz for each chromosome ch, ). Use the -A, --array argument to specify a directory containing these input files.

It also requires (specified by the flag -b, --baffile) a tab-separated file specifying the allele counts for heterzygous germline SNPs from all tumor samples. The tab separated file would typically be produced by the count-alleles command and has the following fields:

Field Description
CHR Name of a chromosome
POS Genomic position corresponding to a heterozygous germline in CHR
SAMPLE Name of a tumor sample
REF_COUNT Count of reads covering POS with reference allele
ALT_COUNT Count of reads covering POS with alternate allele

Finally, combine-counts requires a TSV file (-t, --totalcounts) specifying the total number of reads aligned in each sample (also typically produced by count-reads).

In summary, the following arguments are required to specify input:

Name Description Usage Default
-A, --array Directory containing intermediate files Typically populated by count-reads. For each chromosome ch, this directory should contain files ch.total.gz and ch.thresholds.gz (as well as samples.txt indicating sample names)
-b, --baffile Tab-separated file with allele counts Typically produced by count-alleles. See description above.
-t, --totalcounts Tab-separated file with total aligned reads for each sample Typically produced by count-reads.
-V, --refversion Reference genome version Either "hg19" or "hg38". This argument is used to select which centromere locations to use.

Output

combine-counts produces a tab-separated file (-o, --outfile) with the following fields.

Field Description
CHR Name of a chromosome
START Starting genomic position of a genomic bin in CHR
END Ending genomic position of a genomic bin in CHR
SAMPLE Name of a tumor sample
RD RDR of the bin in SAMPLE (corrected by the total reads in SAMPLE vs. the total reads in the matched normal sample)
#SNPS Number of SNPs present in the bin in SAMPLE
COV Average coverage in the bin in SAMPLE
ALPHA Alpha parameter related to the binomial model of BAF for the bin in SAMPLE, typically total number of reads from A allele
BETA Beta parameter related to the binomial model of BAF for the bin in SAMPLE, typically total number of reads from B allele
BAF BAF of the bin in SAMPLE
TOTAL_READS Total number of reads in the bin in SAMPLE
NORMAL_READS Total number of reads in the bin in the matched normal sample
CORRECTED_READS Total number of reads in the bin in SAMPLE, corrected by the total reads in SAMPLE vs. the total reads in matched normal.

Currently, it produces one such file that excludes sex chromosomes (for use in HATCHet), and one that includes sex chromosomes (for future use).

Main parameters

combine-counts has some main parameters; the main values of these parameters allow to deal with most of datasets, but their values can be changed or tuned to accommodate the features of special datasets.

Name Description Usage Default
--msr Minimum SNP-covering reads for each bin Each bin constructed by this command must have at least this many reads covering heterozygous SNPs in each sample 5000
--mtr Minimum total reads for each bin Each bin constructed by this command must have at least this many total reads in each sample 5000

Phasing parameters

A phased VCF file must be given via argument -p, --phase to apply reference-based phasing. The remaining parameters control the degree to which the phasing information is used.

Name Description Usage Default
-p, --phase vcf.gz with phasing for all het. SNPs File containing phasing data for germline SNPs, typically phased.vcf.gz if using the HATCHet pipeline. None (no phasing is performed)
-s, --blocksize Maximum phasing block size Maximum distance (in bp) between a pair of SNPs included in the same phasing block (ignored if -p, --phase is not used) 25000
-m, --max_spb Maximum number of SNPs per phased block No more than this many SNPs can be included in the same phasing block (included to minimize phasing errors in high-LD regions) 10
-a, --alpha Significance threshold to allow adjacent SNPs to be merged If adjacent SNPs have significantly different BAFs (at this significance level) after taking the phasing into account, they are not merged a priori. Higher means less trust in phasing. 0.1

Other optional parameters

Name Description Usage Default
-j, --processes Number of parallel processes to use (default 1) 1
-z, --not_compressed Indicates that intermediate files are not compressed For compatibility with legacy versions of previous step -- set this flag if your .total and .thresholds files are plaintext rather than gzipped.

Example usage

hatchet combine-counts -b baf/bulk.1bed -o abin/bulk.bb -j 24 -V hg19 -A array -t array/total.tsv -V hg19