count-alleles

Given one or more BAM files and lists of heterozygous SNP positions, this step of HATCHet counts the number of reads covering both the alleles of each identified heterozgyous SNP in every tumor sample.

Input

count-alleles takes in input sorted and indexed BAM files for multiple tumor samples from the same patient, a sorted and index BAM file from a matched-normal sample, and a indexed human reference genome.

Name Description Usage
-T, --tumors A white-space separated list of sorted-indexed BAM files The tumor samples from the same patient that are jointly analyzed by HATCHet
-N, --normal A sorted-indexed BAM file The matched normal sample for the same patient
-L, --snps VCF files One or more files listing heterozygous SNP positions
-r, --reference A FASTA file The human reference genome used for germline variant calling

Output

count-alleles produces three tab-separated files: the first contains the read counts for every genomic bin in every tumor sample, the second contains the read counts for every genomic bin the matched-normal sample, and the third contains a list of the genomic positions that have been identified as germline heterozygous SNPs in the matched-normal sample.

Name Description Format
-O, --outputnormal The output file for the read counts from matched-normal sample #SAMPLE CHR POS REF_COUNT ALT_COUNT
-o, --outputtumors The output file for the read counts from the tumor samples #SAMPLE CHR POS REF_COUNT ALT_COUNT
-l, --outputsnps the output directory for the list of identified heterozygous germline SNPs #CHR POS

The format fields are described in the following.

Field Description
SAMPLE Name of a sample
CHR Name of the chromosome
POS Genomic position in CHR
REF_COUNT Number of reads harboring reference allele in POS
ALT_COUNT Number of reads harboring alternate allele in POS

Main parameters

count-alleles has some main parameters; the main values of these parameters allow to deal with most of datasets, but their values can be changed or tuned to accommodate the features of special datasets.

Name Description Usage Default
-S, --samples White-space separater list of a names The first name is used for the matched-normal sample, while the others are for the tumor samples and they match the same order of the corresponding BAM files File names are used
-st, --samtools Path to bin directory of SAMtools The path to this direcoty needs to be specified when it is not included in $PATH Path is expected in the enviroment variable $PATH
-bt, --bcftools Path to bin directory of BCFtools The path to this direcoty needs to be specified when it is not included in $PATH Path is expected in the enviroment variable $PATH
-c, --mincov Minimum coverage Minimum number of reads that have to cover a variant to be called, the value can be increased when considering a dataset with high depth (>60x) 8
-C, --maxcov Maximum coverage Maximum number of reads that have to cover a variant to be called, the typically suggested value should be twice higher than expected coverage to avoid sequencing and mapping artifacts 300
-j, --processes Number of parallele jobs Parallel jobs are used to consider the chromosomes in different samples on parallel. The higher the number the better the running time 22

Optional parameters

count-alleles has some optional parameters; changes in the default values of these parameters are not expected to have a significant impact but they can be tuned to better fit the given data.

Name Description Usage Default
-v, --verbose Verbose logging flag When enabled, count-alleles outputs a verbose log of the executiong Not used
-g, --gamma Level of confidence for selecting germline heterozygous SNPs This value is the level of confidence used for the binomial model used to assess whether a called SNPs is in fact germline heterozygous 0.05
-q, --readquality Threshold for phred-score quality of sequencing reads The value can be either decreased (e.g. 10) or increased (e.g. 30) to adjust the filtering of sequencing reads 20
-Q, --basequality Threshold for phred-score quality of sequenced nucleotide bases The value can be either decreased (e.g. 10) or increased (e.g. 30) to adjust the filtering of sequenced nucleotide bases 20
-U, --snpquality Threshold for phred-score quality of called variants The value can be either decreased (e.g. 10) or increased (e.g. 30) to adjust the filtering of called variants 20
-L, --snps Path to file of SNPs in the format #CHR POS When provided, only the included genomic positions will be considered for calling germline SNPs. Using well-known lists (e.g. dbSNP) help to significantly speed up this step Not used, SNPs are called across all genome
-E,--newbaq Flag to enable newbaq veafute of SAMtools When selected, the user asks SAMtools to recompute alignment of reads on the fly during SNP calling Not used
-b, --maxshift Maximum BAF difference from 0.5 When used, only SNPs with an absolute difference between the BAF and 0.5 below the maximum are selected Not used