# count-reads This step of HATCHet uses the locations of heterozygous SNPs (called by `count-alleles`) to identify candidate bin thresholds between SNPs. Then, it counts the total number of reads in each sample between each set of candidate thresholds for use in constructing variable-length bins. ## Input `count-reads` takes in input sorted and indexed BAM files for multiple tumor samples from the same patient, a sorted and index BAM file from a matched-normal sample, an indexed human reference genome, and a 1bed file containing SNP information for this individual (output from `count-alleles` command, normally `baf/bulk.1bed`). | Name | Description | Usage | |------|-------------|-------| | `-T`, `--tumors` | A white-space separated list of sorted-indexed BAM files for tumor samples | The tumor samples from the same patient that are jointly analyzed by HATCHet | | `-N`, `--normal` | A sorted-indexed BAM file for matched-normal sample | The matched normal sample for the same patient | | `-b`, `--baffile` | A 1bed file containing locations of heterozygous germline SNPs | Typically, a user would run `count-alleles` to obtain this file. | | `-V`, `--refversion` | Reference genome version (hg19 or hg38 supported) | | ## Output `count-reads` writes all output files to a given output directory. For each chromosome `chr`, count-reads produces two gzipped files needed to construct adaptive bins: `chr.threhsolds.gz` and `chr.total.gz`. `count-reads` also produces a tab-separated file `total.tsv` containing the total number of reads in each sample, and a text file `samples.txt` containing the list of sample names. | Name | Description | |------|-------------| | `-O`, `--outdir` | Output directory | Directory in which output will be written to (must already exist before running `count-reads`) ## Main parameters | Name | Description | Usage | |------|-------------|-------| | `-V`, `--refversion` | Reference genome version ("hg19" or "hg38" supported) | | ## Optional parameters | Name | Description | Usage | Default | |------|-------------|-------|---------| | `-S`, `--samples` | White-space separater list of a names | The first name is used for the matched-normal sample, while the others are for the tumor samples and they match the same order of the corresponding BAM files | File names are used | | `-j`, `--processes` | Number of parallel jobs | Parallel jobs are used to consider the chromosomed in different samples in parallel. The higher the number the better the running time (up to 22 * n_samples) | 22 | | `-st`, `--samtools` | Path to `bin` directory of SAMtools | The path to this direcoty needs to be specified when it is not included in `$PATH` | Path is expected in the enviroment variable `$PATH` | | `-md`, `--mosdepth` | Path to the `mosdepth` executable | The path to this executable needs to be specified when it is not included in `$PATH` | None (expected on `PATH`) | | `-tx`, `--tabix` | Path to the `tabix` executable | The path to this executable needs to be specified when it is not included in `$PATH` | None (expected on `PATH`) | | `-q`, `--readquality` | Minimum mapping quality (MAPQ) for a read to be counted | Values range from 0 to 41, see alignment documentation for details | 11 | | `-i`, `--intermediates` | Keep intermediate files | Retain intermediate files (read starts and per-position coverage) that are used to compute arrays for binning | False (these files are deleted, subsequent arrays are retained instead) | ## Example usage Given samtools, mosdepth, and tabix are on the PATH, the referenced files are in the current directory, and the intended output directory `array` is present: `hatchet count-reads -T first_sample.bam second_sample.bam -N normal_sample.bam -S normal tumor1 tumor2 -V hg19 -j 24 -O array -b baf/bulk.1bed`