count-reads¶
This step of HATCHet uses the locations of heterozygous SNPs (called by count-alleles
) to identify candidate bin thresholds between SNPs. Then, it counts the total number of reads in each sample between each set of candidate thresholds for use in constructing variable-length bins.
Input¶
count-reads
takes in input sorted and indexed BAM files for multiple tumor samples from the same patient, a sorted and index BAM file from a matched-normal sample, an indexed human reference genome, and a 1bed file containing SNP information for this individual (output from count-alleles
command, normally baf/bulk.1bed
).
Name | Description | Usage |
---|---|---|
-T , --tumors |
A white-space separated list of sorted-indexed BAM files for tumor samples | The tumor samples from the same patient that are jointly analyzed by HATCHet |
-N , --normal |
A sorted-indexed BAM file for matched-normal sample | The matched normal sample for the same patient |
-b , --baffile |
A 1bed file containing locations of heterozygous germline SNPs | Typically, a user would run count-alleles to obtain this file. |
-V , --refversion |
Reference genome version (hg19 or hg38 supported) |
Output¶
count-reads
writes all output files to a given output directory. For each chromosome chr
, count-reads produces two gzipped files needed to construct adaptive bins: chr.threhsolds.gz
and chr.total.gz
. count-reads
also produces a tab-separated file total.tsv
containing the total number of reads in each sample, and a text file samples.txt
containing the list of sample names.
Name | Description |
---|---|
-O , --outdir |
Output directory |
Main parameters¶
Name | Description | Usage |
---|---|---|
-V , --refversion |
Reference genome version ("hg19" or "hg38" supported) |
Optional parameters¶
Name | Description | Usage | Default |
---|---|---|---|
-S , --samples |
White-space separater list of a names | The first name is used for the matched-normal sample, while the others are for the tumor samples and they match the same order of the corresponding BAM files | File names are used |
-j , --processes |
Number of parallel jobs | Parallel jobs are used to consider the chromosomed in different samples in parallel. The higher the number the better the running time (up to 22 * n_samples) | 22 |
-st , --samtools |
Path to bin directory of SAMtools |
The path to this direcoty needs to be specified when it is not included in $PATH |
Path is expected in the enviroment variable $PATH |
-md , --mosdepth |
Path to the mosdepth executable |
The path to this executable needs to be specified when it is not included in $PATH |
None (expected on PATH ) |
-tx , --tabix |
Path to the tabix executable |
The path to this executable needs to be specified when it is not included in $PATH |
None (expected on PATH ) |
-q , --readquality |
Minimum mapping quality (MAPQ) for a read to be counted | Values range from 0 to 41, see alignment documentation for details | 11 |
-i , --intermediates |
Keep intermediate files | Retain intermediate files (read starts and per-position coverage) that are used to compute arrays for binning | False (these files are deleted, subsequent arrays are retained instead) |
## Example usage |
Given samtools, mosdepth, and tabix are on the PATH, the referenced files are in the current directory, and the intended output directory array
is present:
hatchet count-reads -T first_sample.bam second_sample.bam -N normal_sample.bam -S normal tumor1 tumor2 -V hg19 -j 24 -O array -b baf/bulk.1bed