Question

How can I extract and count unmapped reads using htseq count?

1

Entering edit mode

9.5 years ago

farah.dahalan ▴ 10

I have few questions regarding htseq count.

I run the htseq in virtual box using this command:

samtools view SMA1.bam | htseq-count -s no -i ID -t gene - Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.2.gff3 > htseq2/SMA1.text
biouser@VirtualBox:~/Desktop/sf_farahdata$ samtools view SMA2.bam | htseq-count -s no -i ID -t gene - Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.2.gff3 > htseq2/SMA2.text

100000 GFF lines processed.
175804 GFF lines processed.
Warning: Read HS31_15535:2:2109:18695:7195#14 claims to have an aligned mate which could not be found in an adjacent line.
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Warning: Malformed SAM line: MRNM == '=' although read is not aligned.
Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
100000 SAM alignment record pairs processed.
200000 SAM alignment record pairs processed.
300000 SAM alignment record pairs processed.

Q1: What does the warning means?

I use this command to pull out unmapped reads:

samtools view -f 4 -c SMA2.bam

I attached the results from the reads data that I obtained. It only shows the read information rather than read counts.

Q2: How can I extracting counts from RNAseq data that does not align to a feature.

Thank you.

ht-seq RNA-Seq count • 3.6k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by farah.dahalan ▴ 10

Ram · Answer 1 · 2015-05-20

0

Entering edit mode

9.5 years ago

CraigM ▴ 90

For your second question:

Remember that in this situation 'counts' refers to the number of reads aligned to a particular gene feature.

If you wish to know the total amount of reads which did not align, this can be found at the bottom of the htseq output using the tail command.

If you mean you wish to extract the actual 'reads' which did not align to a feature there are better methods for doing so such as using samtools view with the '-f 4' flag to show only unaligned reads (However this will not include reads which did align to the genome, but not to gene features unless you used an annotation file during alignment and restricted this happening).

This will give you the sam/bam records of the unmapped reads (if you want the fastq sequence, use the bedtools bamtofastq tool).

Another method of finding the unmapped reads is to use tophat for your alignments. The output of this tool by default includes the unmapped reads.

This thread may be useful for more information: How To Filter Mapped Reads With Samtools

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by CraigM ▴ 90

0

Entering edit mode

Hi Craig,

Thank you for your reply.

Maybe my question is not very clear. I use htseq count using annotated reference GFF file and I got around 3million of 'no_feature' reads. I am interested to get the counts on these unannotated regions. I wonder if I could do this using htseq count or any other tools?

Thank you in advance. :)

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by farah.dahalan ▴ 10

0

Entering edit mode

Use the -o option, and then grep the resulting SAM file for no_feature. You'll likely find that these are in introns, around UTRs, and in repeat regions.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Devon Ryan 104k