Question

Set the max coverage wanted with Hisat2 ?

0

Entering edit mode

5.0 years ago

Chvatil ▴ 130

Hello everyone, I'm actually using Hisat in order to map reads against one genome.

Here is the job file :

FILE=Genome.fa
READS1a=SRR9110374_1.fastq.gz
READS2a=SRR9110374_2.fastq.gz


###############################################################
#       SOFTWARES                                             #
##############################################################

# MAPPING HISAT2
HISAT2=/TOOLS/hisat2-2.1.0/


#bedtools
BEDTOOLS=/usr/bin/bedtools

###############################################################
#        MAP SHORT READS TO INFER COVERAGE DEPTH              
##############################################################

# copy on cluster

cd $OUT

# build index
$HISAT2/hisat2-build $FILE mapping_index
# mapping
$HISAT2/hisat2 -k 1 -q -x mapping_index -1 $READS1a -2 $READS2a -S mapping.sam > stats_mapping.txt

But the issue is here :

The R1 and R2 files are relatively huge and when I generate the .sam file it takes a lot of memory space ! I wondered then if Hisat2 was able to specify the threshold coverage to reach ? In other word can we define the number of read we want to map against the genome even if we do not take all the reads present in fastQ files ?

Tanks for your help

Hisat2 mapping • 1.5k views

ADD COMMENT • link 5.0 years ago by Chvatil ▴ 130

score 2 · Answer 1 · 2019-12-16

2

Entering edit mode

5.0 years ago

Devon Ryan 105k

The -u option can be used to specify how many reads/pairs in total to align. Note however that this will NOT notably decrease memory usage, since the fastq files and the output aren't held in memory. Usually hisat2 requires < 30GB RAM, which is quite reasonable.

ADD COMMENT • link 5.0 years ago by Devon Ryan 105k

0

Entering edit mode

I was talking about the hard disk space sorry. So if I desire a coverage mapping of 10X I have to put -u 10 right ?

ADD REPLY • link 5.0 years ago by Chvatil ▴ 130

0

Entering edit mode

No, see my answer. I removed the accepted tag as the question is not solved yet, even though Devon Ryan 's answer is of course perfectly fine from the technical side towards how to limit hisat2 to process only a certain number of reads.

ADD REPLY • link 5.0 years ago by ATpoint 86k

score 1 · Answer 2 · 2019-12-16

It is not that simple. You cannot assume that 10mio reads will give you 10x coverage. It depends on genome size and read length.

Lets calculate theoretical coverage with Lander/Waterman equation: (https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf)

C = LN / G where C stands for coverage, G is the haploid genome length, L is the read length and N is the number of reads.

the Apis mellifera genome has a length of about 200Mb, so G=200.000.000
the read length is 2x100bp (see here) so L = 200
the dataset has 147.505.611 reads so N= 147.505.611

Therfore assuming a perfect world with all reads fully usable and no duplicates: 200 * 147505611 / 200000000 = 147x coverage

So you have 14.7fold more reads than necessary, which is of course the theoretical value. Considering duplicates, low quality reads etc. you might downsample the total read number to like 20%.

Also, use Unix pipes and do not save alignments in SAM but in the much smaller BAM format. BAM is a binary version of SAM and there is no reason to use the much larger SAM format. 99.9% of downstream tools can read BAM. For this do:

$HISAT2/hisat2 -k 1 -q -x mapping_index -1 $READS1a -2 $READS2a | samtools view -o mapping.bam

The | sends the output of hisat2 to so-called stdout which is then processed by the tool behind | and in this case directly saved in the compressed BAM format. Given that BAM is much smaller than BAM you might even try to simply save the entire dataset instead of a subset.

I am also reasonably sure that hisat2 does not send alignment stats to stdout so you would need to capture stderr for stats_mapping.txt using 2>.