count-reads¶
This step of HATCHet uses the locations of heterozygous SNPs (called by count-alleles) to identify candidate bin thresholds between SNPs. Then, it counts the total number of reads in each sample between each set of candidate thresholds for use in constructing variable-length bins.
Input¶
count-reads takes in input sorted and indexed BAM files for multiple tumor samples from the same patient, a sorted and index BAM file from a matched-normal sample, an indexed human reference genome, and a 1bed file containing SNP information for this individual (output from count-alleles command, normally baf/bulk.1bed).
| Name | Description | Usage | 
|---|---|---|
| -T,--tumors | A white-space separated list of sorted-indexed BAM files for tumor samples | The tumor samples from the same patient that are jointly analyzed by HATCHet | 
| -N,--normal | A sorted-indexed BAM file for matched-normal sample | The matched normal sample for the same patient | 
| -b,--baffile | A 1bed file containing locations of heterozygous germline SNPs | Typically, a user would run count-allelesto obtain this file. | 
| -V,--refversion | Reference genome version (hg19 or hg38 supported) | 
Output¶
count-reads writes all output files to a given output directory. For each chromosome chr, count-reads produces two gzipped files needed to construct adaptive bins: chr.threhsolds.gz and chr.total.gz. count-reads also produces a tab-separated file total.tsv containing the total number of reads in each sample, and a text file samples.txt containing the list of sample names.
| Name | Description | 
|---|---|
| -O,--outdir | Output directory | 
Main parameters¶
| Name | Description | Usage | 
|---|---|---|
| -V,--refversion | Reference genome version ("hg19" or "hg38" supported) | 
Optional parameters¶
| Name | Description | Usage | Default | 
|---|---|---|---|
| -S,--samples | White-space separater list of a names | The first name is used for the matched-normal sample, while the others are for the tumor samples and they match the same order of the corresponding BAM files | File names are used | 
| -j,--processes | Number of parallel jobs | Parallel jobs are used to consider the chromosomed in different samples in parallel. The higher the number the better the running time (up to 22 * n_samples) | 22 | 
| -st,--samtools | Path to bindirectory of SAMtools | The path to this direcoty needs to be specified when it is not included in $PATH | Path is expected in the enviroment variable $PATH | 
| -md,--mosdepth | Path to the mosdepthexecutable | The path to this executable needs to be specified when it is not included in $PATH | None (expected on PATH) | 
| -tx,--tabix | Path to the tabixexecutable | The path to this executable needs to be specified when it is not included in $PATH | None (expected on PATH) | 
| -q,--readquality | Minimum mapping quality (MAPQ) for a read to be counted | Values range from 0 to 41, see alignment documentation for details | 11 | 
| -i,--intermediates | Keep intermediate files | Retain intermediate files (read starts and per-position coverage) that are used to compute arrays for binning | False (these files are deleted, subsequent arrays are retained instead) | 
| ## Example usage | 
Given samtools, mosdepth, and tabix are on the PATH, the referenced files are in the current directory, and the intended output directory array is present:
hatchet count-reads -T first_sample.bam second_sample.bam -N normal_sample.bam -S normal tumor1 tumor2 -V hg19 -j 24 -O array -b baf/bulk.1bed