count-reads¶

This step of HATCHet uses the locations of heterozygous SNPs (called by count-alleles) to identify candidate bin thresholds between SNPs. Then, it counts the total number of reads in each sample between each set of candidate thresholds for use in constructing variable-length bins.

Input¶

count-reads takes in input sorted and indexed BAM files for multiple tumor samples from the same patient, a sorted and index BAM file from a matched-normal sample, an indexed human reference genome, and a 1bed file containing SNP information for this individual (output from count-alleles command, normally baf/bulk.1bed).

Name	Description	Usage
`-T`, `--tumors`	A white-space separated list of sorted-indexed BAM files for tumor samples	The tumor samples from the same patient that are jointly analyzed by HATCHet
`-N`, `--normal`	A sorted-indexed BAM file for matched-normal sample	The matched normal sample for the same patient
`-b`, `--baffile`	A 1bed file containing locations of heterozygous germline SNPs	Typically, a user would run `count-alleles` to obtain this file.
`-V`, `--refversion`	Reference genome version (hg19 or hg38 supported)

Output¶

count-reads writes all output files to a given output directory. For each chromosome chr, count-reads produces two gzipped files needed to construct adaptive bins: chr.threhsolds.gz and chr.total.gz. count-reads also produces a tab-separated file total.tsv containing the total number of reads in each sample, and a text file samples.txt containing the list of sample names.

Name	Description
`-O`, `--outdir`	Output directory

Main parameters¶

Name	Description	Usage
`-V`, `--refversion`	Reference genome version ("hg19" or "hg38" supported)

Optional parameters¶

Name	Description	Usage	Default
`-S`, `--samples`	White-space separater list of a names	The first name is used for the matched-normal sample, while the others are for the tumor samples and they match the same order of the corresponding BAM files	File names are used
`-j`, `--processes`	Number of parallel jobs	Parallel jobs are used to consider the chromosomed in different samples in parallel. The higher the number the better the running time (up to 22 * n_samples)	22
`-st`, `--samtools`	Path to `bin` directory of SAMtools	The path to this direcoty needs to be specified when it is not included in `$PATH`	Path is expected in the enviroment variable `$PATH`
`-md`, `--mosdepth`	Path to the `mosdepth` executable	The path to this executable needs to be specified when it is not included in `$PATH`	None (expected on `PATH`)
`-tx`, `--tabix`	Path to the `tabix` executable	The path to this executable needs to be specified when it is not included in `$PATH`	None (expected on `PATH`)
`-q`, `--readquality`	Minimum mapping quality (MAPQ) for a read to be counted	Values range from 0 to 41, see alignment documentation for details	11
`-i`, `--intermediates`	Keep intermediate files	Retain intermediate files (read starts and per-position coverage) that are used to compute arrays for binning	False (these files are deleted, subsequent arrays are retained instead)
## Example usage

Given samtools, mosdepth, and tabix are on the PATH, the referenced files are in the current directory, and the intended output directory array is present:

hatchet count-reads -T first_sample.bam second_sample.bam -N normal_sample.bam -S normal tumor1 tumor2 -V hg19 -j 24 -O array -b baf/bulk.1bed