Hello, I have mapped reads in bam format from tophat, and I would like to filter it in a way that only those paired reads which mapped to genes (in GTF) would remain. Would appreciate advice on how I can accomplish this. thanks
Hello, I have mapped reads in bam format from tophat, and I would like to filter it in a way that only those paired reads which mapped to genes (in GTF) would remain. Would appreciate advice on how I can accomplish this. thanks
Use htseq-count with the -o option and filter the resulting SAM file.
Edit: Removed mention of featureCounts, apparently it doesn't produce a SAM file with -R
.
I think that it should be following
gtf2bed
)intersectBed -abam <your bam> -b <gtf to bed file>
have your samheader too for downstream analysis
samtools view -H
printed this:
@HD VN:1.0 SO:queryname
@SQ SN:chr1 LN:195471971
@SQ SN:chr10 LN:130694993
@SQ SN:chr11 LN:122082543
@SQ SN:chr12 LN:120129022
@SQ SN:chr13 LN:120421639
@SQ SN:chr14 LN:124902244
@SQ SN:chr15 LN:104043685
@SQ SN:chr16 LN:98207768
@SQ SN:chr17 LN:94987271
@SQ SN:chr18 LN:90702639
@SQ SN:chr19 LN:61431566
@SQ SN:chr2 LN:182113224
@SQ SN:chr3 LN:160039680
@SQ SN:chr4 LN:156508116
@SQ SN:chr5 LN:151834684
@SQ SN:chr6 LN:149736546
@SQ SN:chr7 LN:145441459
@SQ SN:chr8 LN:129401213
@SQ SN:chr9 LN:124595110
@SQ SN:chrM LN:16299
@SQ SN:chrX LN:171031299
@SQ SN:chrY LN:91744698
@PG ID:TopHat [original command]
I did steps 1 and 2 but when I tried extracting the reads, I got 0.5% fewer properly paired reads than were counted by samtools flagstat in the resulting bam. Picard SamToFastq, for some reason, considered 0.5% of the reads as unpaired mates and so they had to be excluded, although they were for sure paired before the filtering and even after the filtering in the resulting bam (based on flagstat). Losing 0.5% is not too bad, but if you have a solution for rescuing them, please let me know. thank you.
Assuming your sam file is coordinate sorted, you can use tabix.
bgzip Input_sorted.sam
tabix -p sam Input_sorted.sam.gz
tabix Input_sorted.sam.gz chrA:start-end chrB:start-end chrC:start-end > Input_gtf.sam
(provide the string containing GTF coordinates in a sorted order. Unfortunately tabix doesn't take bed file containing region of interest as an input)Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
thank you, I actually tried that but the output sam did not have reads, only lines like this:
Then something got broken in the release you're using. Try a different version.
which version did you use? I used HTSeq-0.6.1 which had
-f
options that allows accepting bam (my input was bam). The counts though made sense. thank youI don't have handy a version known to work correctly, sorry.
instead of using the new
-f
option for bam input, I just tried the old way, using sam input. it did generate sam output with reads. but when I tried converting to bamI got error:
do you know what the problem might be?
thank you
which command would you recommend for filtering the resulting SAM file? I tried converting to BAM without filtering first using this command
but got this error:
I then tried
samtools view -bT genome.fasta in.sam > out.bam
but it just restored all the reads that were reduced by the htseq-count. So I guess it is important to filter out the reduced reads from SAM prior to converting to BAM. Based on the description on htseq website it seems that the reads in outputted SAM are marked as unmapped to GTF but are not actually deleted, so I would appreciate if you could recommend command for actually deleting them. thank you.