Question

HISAT2 output direct to bam

1

Entering edit mode

6.3 years ago

Adrian Pelin ★ 2.7k

Hello,

HISAT2 can only output sam files, which can be quite large. I found an option to output only reads that mapped to the reference but still the files can be 100s of GB.

Is there any issue in directly converting to bam as such:

${hisat2}/hisat2 -p 4 --rg-id=${4} -x $l --dta ${strand} --no-unal -1 $2 -2 $3 -U $4 | samtools view -Sbh > hisat2_output.bam

I am specifically asking about piping stdout to samtools view to convert to bam.

Thanks.

RNA-Seq samtools • 14k views

ADD COMMENT • link updated 4.0 years ago by ATpoint 89k • written 6.3 years ago by Adrian Pelin ★ 2.7k

1

Entering edit mode

6.3 years ago

swbarnes2 15k

That should work. Better, pipe straight to samtools sort.

ADD COMMENT • link 6.3 years ago by swbarnes2 15k

ATpoint · Accepted Answer · 2019-05-04

10

Entering edit mode

6.3 years ago

ATpoint 89k

That will work fine (and is even the recommended workflow as SAM files are uncompressed and only take up space). You can even have it more efficient by doing:

hisat2 (options)... | \
  tee >(samtools flagstat - > hisat2_output.flagstat) | \
  samtools sort -O BAM | \
  tee hisat2_output.bam | \
  samtools index - hisat2_output.bam.bai

That will give you the sorted BAM, the flagstat summary statistics and the BAM index all from hisat2 stdout. tee duplicates the stream from stdin and here in combination with process substitution can be exploited to run multiple commands on essentially the same file.

Edit (08/2021): The above answer is probably a bit "over-the-top". In any case, the recent samtools versions have an option --write-index for many subcommands such as sort, so the indexing is done on the fly during the sort command. See the manual for details.

ADD COMMENT • link 4.0 years ago by ATpoint 89k

0

Entering edit mode

Ok, it makes sense to pipe to sort like swbarnes2 and yourself suggested. What is the impact on RAM usage? Piping to unsorted bam can be done on the go. Do you need to load the whole output into memory in order for you to sort?

ADD REPLY • link updated 6.3 years ago by ATpoint 89k • written 6.3 years ago by Adrian Pelin ★ 2.7k

1

Entering edit mode

sort has a -m option that specifies the amount of memory to be used before it spills data as intermediate/tmp files to disk. Default is 768Mb I think per thread.

ADD REPLY • link 6.3 years ago by ATpoint 89k

0

Entering edit mode

Hi, I have a problem about this command. It's can work when I use a bash script to execute it but not in command line. When I use command line, I got the error message: -sh: syntax error near unexpected token `(' , I can't figure out what happened. Could anybody know what wrong with this?

hisat2 -q --time --novel-splicesite-outfile wt_2.ns.tsv --summary-file wt_2.sf.txt --met-file wt_2.mf.txt --threads 8 -x genome_snp_tran -1 wt_2_1P.fastq.gz -2 wt_2_2P.fastq.gz | tee >(samto
ols flagstat - > wt_2.flagstat) | samtools sort -O BAM | tee wt_2.bam | samtools index - wt_2.bam.bai

ADD REPLY • link updated 4.9 years ago by ATpoint 89k • written 4.9 years ago by afukada • 0

0

Entering edit mode

Let me answer this question myself. I find that because of the "sh" issue. If you want to do this, you should be sure to use one of bash/ksh/zsh. Using for instance "sh" won't work.

ADD REPLY • link 4.9 years ago by afukada • 0