how to export RNA-seq reads which mapped to genes?
3
0
Entering edit mode
10.1 years ago
trakhtenberg ▴ 160

Hello, I have mapped reads in bam format from tophat, and I would like to filter it in a way that only those paired reads which mapped to genes (in GTF) would remain. Would appreciate advice on how I can accomplish this. thanks

RNA-Seq • 3.0k views
ADD COMMENT
1
Entering edit mode
10.1 years ago

Use htseq-count with the -o option and filter the resulting SAM file.

Edit: Removed mention of featureCounts, apparently it doesn't produce a SAM file with -R.

ADD COMMENT
0
Entering edit mode

thank you, I actually tried that but the output sam did not have reads, only lines like this:

XF:Z:XLOC_040566
XF:Z:XLOC_040566
XF:Z:XLOC_015449
XF:Z:XLOC_015449
XF:Z:__no_feature
XF:Z:__no_feature
XF:Z:XLOC_028412
XF:Z:XLOC_028412
ETC.....
ADD REPLY
0
Entering edit mode

Then something got broken in the release you're using. Try a different version.

ADD REPLY
0
Entering edit mode

which version did you use? I used HTSeq-0.6.1 which had -f options that allows accepting bam (my input was bam). The counts though made sense. thank you

ADD REPLY
0
Entering edit mode

I don't have handy a version known to work correctly, sorry.

ADD REPLY
0
Entering edit mode

instead of using the new -f option for bam input, I just tried the old way, using sam input. it did generate sam output with reads. but when I tried converting to bam

samtools view -bS in.sam > out.bam

I got error:

[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!

do you know what the problem might be?

thank you

ADD REPLY
0
Entering edit mode

which command would you recommend for filtering the resulting SAM file? I tried converting to BAM without filtering first using this command

samtools view -bS in.sam > out.bam

but got this error:

[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!

I then tried samtools view -bT genome.fasta in.sam > out.bam but it just restored all the reads that were reduced by the htseq-count. So I guess it is important to filter out the reduced reads from SAM prior to converting to BAM. Based on the description on htseq website it seems that the reads in outputted SAM are marked as unmapped to GTF but are not actually deleted, so I would appreciate if you could recommend command for actually deleting them. thank you.

ADD REPLY
1
Entering edit mode
10.1 years ago
Manvendra Singh ★ 2.2k

I think that it should be following

  1. convert your gtf to bed (gtf2bed)
  2. intersectBed -abam <your bam> -b <gtf to bed file>

have your samheader too for downstream analysis

ADD COMMENT
0
Entering edit mode

Manu you got lucky this time :-)

ADD REPLY
0
Entering edit mode

:) :) , Now I know, we have synchronized answering time.

ADD REPLY
0
Entering edit mode

I did steps 1 and 2, and now have bam output. could you please clarify what you mean by "have your samheader too for downstream analysis"? thank you

ADD REPLY
0
Entering edit mode

Try samtools view -H on your new bam and if it prints some lines then you are fine.

ADD REPLY
0
Entering edit mode

samtools view -H printed this:

@HD     VN:1.0  SO:queryname
@SQ     SN:chr1 LN:195471971
@SQ     SN:chr10        LN:130694993
@SQ     SN:chr11        LN:122082543
@SQ     SN:chr12        LN:120129022
@SQ     SN:chr13        LN:120421639
@SQ     SN:chr14        LN:124902244
@SQ     SN:chr15        LN:104043685
@SQ     SN:chr16        LN:98207768
@SQ     SN:chr17        LN:94987271
@SQ     SN:chr18        LN:90702639
@SQ     SN:chr19        LN:61431566
@SQ     SN:chr2 LN:182113224
@SQ     SN:chr3 LN:160039680
@SQ     SN:chr4 LN:156508116
@SQ     SN:chr5 LN:151834684
@SQ     SN:chr6 LN:149736546
@SQ     SN:chr7 LN:145441459
@SQ     SN:chr8 LN:129401213
@SQ     SN:chr9 LN:124595110
@SQ     SN:chrM LN:16299
@SQ     SN:chrX LN:171031299
@SQ     SN:chrY LN:91744698
@PG     ID:TopHat  [original command]
ADD REPLY
0
Entering edit mode

I did steps 1 and 2 but when I tried extracting the reads, I got 0.5% fewer properly paired reads than were counted by samtools flagstat in the resulting bam. Picard SamToFastq, for some reason, considered 0.5% of the reads as unpaired mates and so they had to be excluded, although they were for sure paired before the filtering and even after the filtering in the resulting bam (based on flagstat). Losing 0.5% is not too bad, but if you have a solution for rescuing them, please let me know. thank you.

ADD REPLY
1
Entering edit mode
10.1 years ago

Assuming your sam file is coordinate sorted, you can use tabix.

  1. bgzip Input_sorted.sam
  2. tabix -p sam Input_sorted.sam.gz
  3. tabix Input_sorted.sam.gz chrA:start-end chrB:start-end chrC:start-end > Input_gtf.sam (provide the string containing GTF coordinates in a sorted order. Unfortunately tabix doesn't take bed file containing region of interest as an input)
ADD COMMENT
0
Entering edit mode

thank you for responding!

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6