combine-counts¶

This step constructs variable-length bins that ensure that each bin has at least some number (--msr) of SNP-covering reads and at least some number (--mtr) of total reads per bin. Then, it combines the read counts and the allele counts for the identified germline SNPs to compute the read-depth ratio (RDR) and B-allele frequency (BAF) of every genomic bin.

Input¶

combine-counts takes in input the output from count-reads (i.e., two gzipped files ch.total.gz and ch.thresholds.gz for each chromosome ch, ). Use the -A, --array argument to specify a directory containing these input files.

It also requires (specified by the flag -b, --baffile) a tab-separated file specifying the allele counts for heterzygous germline SNPs from all tumor samples. The tab separated file would typically be produced by the count-alleles command and has the following fields:

Field	Description
`CHR`	Name of a chromosome
`POS`	Genomic position corresponding to a heterozygous germline in `CHR`
`SAMPLE`	Name of a tumor sample
`REF_COUNT`	Count of reads covering `POS` with reference allele
`ALT_COUNT`	Count of reads covering `POS` with alternate allele

Finally, combine-counts requires a TSV file (-t, --totalcounts) specifying the total number of reads aligned in each sample (also typically produced by count-reads).

In summary, the following arguments are required to specify input:

Name	Description	Usage
`-A`, `--array`	Directory containing intermediate files	Typically populated by `count-reads`. For each chromosome `ch`, this directory should contain files `ch.total.gz` and `ch.thresholds.gz` (as well as `samples.txt` indicating sample names)
`-b, --baffile`	Tab-separated file with allele counts	Typically produced by `count-alleles`. See description above.
`-t, --totalcounts`	Tab-separated file with total aligned reads for each sample	Typically produced by `count-reads`.
`-V, --refversion`	Reference genome version	Either "hg19" or "hg38". This argument is used to select which centromere locations to use.

Output¶

combine-counts produces a tab-separated file (-o, --outfile) with the following fields.

Field	Description
`CHR`	Name of a chromosome
`START`	Starting genomic position of a genomic bin in `CHR`
`END`	Ending genomic position of a genomic bin in `CHR`
`SAMPLE`	Name of a tumor sample
`RD`	RDR of the bin in `SAMPLE` (corrected by the total reads in `SAMPLE` vs. the total reads in the matched normal sample)
`#SNPS`	Number of SNPs present in the bin in `SAMPLE`
`COV`	Average coverage in the bin in `SAMPLE`
`ALPHA`	Alpha parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from A allele
`BETA`	Beta parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from B allele
`BAF`	BAF of the bin in `SAMPLE`
`TOTAL_READS`	Total number of reads in the bin in `SAMPLE`
`NORMAL_READS`	Total number of reads in the bin in the matched normal sample
`CORRECTED_READS`	Total number of reads in the bin in `SAMPLE`, corrected by the total reads in `SAMPLE` vs. the total reads in matched normal.

Currently, it produces one such file that excludes sex chromosomes (for use in HATCHet), and one that includes sex chromosomes (for future use).

Main parameters¶

combine-counts has some main parameters; the main values of these parameters allow to deal with most of datasets, but their values can be changed or tuned to accommodate the features of special datasets.

Name	Description	Usage	Default
`--msr`	Minimum SNP-covering reads for each bin	Each bin constructed by this command must have at least this many reads covering heterozygous SNPs in each sample	5000
`--mtr`	Minimum total reads for each bin	Each bin constructed by this command must have at least this many total reads in each sample	5000

Phasing parameters¶

A phased VCF file must be given via argument -p, --phase to apply reference-based phasing. The remaining parameters control the degree to which the phasing information is used.

Name	Description	Usage	Default
`-p`, `--phase`	vcf.gz with phasing for all het. SNPs	File containing phasing data for germline SNPs, typically `phased.vcf.gz` if using the HATCHet pipeline.	None (no phasing is performed)
`-s`, `--blocksize`	Maximum phasing block size	Maximum distance (in bp) between a pair of SNPs included in the same phasing block (ignored if `-p, --phase` is not used)	25000
`-m`, `--max_spb`	Maximum number of SNPs per phased block	No more than this many SNPs can be included in the same phasing block (included to minimize phasing errors in high-LD regions)	10
`-a`, `--alpha`	Significance threshold to allow adjacent SNPs to be merged	If adjacent SNPs have significantly different BAFs (at this significance level) after taking the phasing into account, they are not merged a priori. Higher means less trust in phasing.	0.1

Other optional parameters¶

Name	Description	Usage	Default
`-j`, `--processes`	Number of parallel processes to use (default 1)		1
`-z, --not_compressed`	Indicates that intermediate files are not compressed	For compatibility with legacy versions of previous step -- set this flag if your `.total` and `.thresholds` files are plaintext rather than gzipped.

Example usage¶

hatchet combine-counts -b baf/bulk.1bed -o abin/bulk.bb -j 24 -V hg19 -A array -t array/total.tsv -V hg19