Hi,
I want to extract read-pairs that aligned concordantly exactly 1 time to the reference genome. Is there any way to parse the output SAM file? I would really appreciate your suggestions.
Best regards,
Rahul
Hi,
I want to extract read-pairs that aligned concordantly exactly 1 time to the reference genome. Is there any way to parse the output SAM file? I would really appreciate your suggestions.
Best regards,
Rahul
The simplest way is likely something along the lines of:
samtools view -hf 0x2 alignments.bam | grep -v "XS:i:" > filtered.alignments.sam
As I mentioned in the comment above, the -f 0x2
part will get only "properly paired" alignments, which will effectively be the concordant alignments. For the option #2 definition of "map only once", you can take advantage of the fact that bowtie2 will add an XS
auxiliary tag to reads that have another "valid" mapping. So, a quick inverse grep (grep -v
) can get rid of those. One possible problem with this is if only one read of a pair can map to multiple places (e.g., the original fragment partly overlapped a simple tandem repeat) then you'd end up with orphans. The easiest fix to that (if it's a problem) would be to just check if the read names of pairs of alignments are the same. I'm sure someone will come up with a way to do that in awk, but the following python script is likely simpler:
#!/usr/bin/env python
import csv
import sys
f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
last_read = None
for line in f :
#take care of the header
if(line[0][0] == "@") :
of.writerow(line)
continue
if(last_read == None) :
last_read = line
else :
if(last_read[0] == line[0]) :
of.writerow(last_read)
of.writerow(line)
last_read = None
else :
last_read = line
If you saved that as foo.py
and made it executable and in your PATH, then the following would solve the aforementioned issue:
samtools view -hf 0x2 alignments.bam | grep -v "XS:i:" | foo.py > filtered.alignments.sam
Note that this won't work with coordinate-sorted alignments, but bowtie2 doesn't produce those.
This didn't work for me. The number of extracted reads does not match the number in the summary made by Bowtie.
An example:
Time loading reference: 00:00:20
Time loading forward index: 00:00:49
Time loading mirror index: 00:00:26
Multiseed full-index search: 00:02:43
1020534 reads; of these:
1020534 (100.00%) were paired; of these:
98406 (9.64%) aligned concordantly 0 times
766336 (75.09%) aligned concordantly exactly 1 time
155792 (15.27%) aligned concordantly >1 times
----
98406 pairs aligned concordantly 0 times; of these:
52167 (53.01%) aligned discordantly 1 time
----
46239 pairs aligned 0 times concordantly or discordantly; of these:
92478 mates make up the pairs; of these:
51106 (55.26%) aligned 0 times
14846 (16.05%) aligned exactly 1 time
26526 (28.68%) aligned >1 times
97.50% overall alignment rate
Time searching: 00:11:38
Overall time: 00:12:29
So I want to extract those 766336 (75.09%) that are aligned concordantly exactly 1 time.
I can get those that are concordantly aligned exactly 1 time or > 1 times (766336 + 155792 = 922128) by:
samtools view alignments.bam | grep "YT:Z:CP" | wc -l
or
samtools view -f 0x2 alignments.bam | wc -l
Which gives 1844256 = 922128*2
Your solution:
samtools view -f 0x2 alignments.bam | grep -v "XS:i:" | foo.py | wc -l
gives 1459732 = 2*729866, which is not the desired result.
A little bite late but I use the following awk scripts on concordantly mapped paired-end reads:
awk '$0!~/^@/ {key=$0; getline; print key "delimiter you want" $0;next}{print $0}' mysample.sam > collapsed.sam
It actually collapse pairs of rows to put paired-end reads on the same line (ignoring the header). use a delimiter that is not present in your sam file ( '{' or '}' for illumina phred33 for example).
awk '$0!~/XS:i:/' collapsed.sam > filtered_collapsed.sam
Then filter line if the pattern "XS:i:" is found
awk '{n=split($0,a,"delimiter you want"); for (i = 1; i <= n; ++i) print a[i]}' filtered_collapsed.sam > filtered_unique.sam
and finally split remaining lines according to the delimiter to rebuild a correct sam file. Be careful to sort your sam file by read name so reads from the same pair are consecutive.
hope it helps
After further investigation I noticed that the read1 and read2 in the flagstat output where not equal. There was a subset of reads (in my data anyway) that were designated SAM flag 256 "not primary alignment". I removed these using -F 256
alongside -f 2
. My original bowtie parameters were --sensitive
and -k2
.
Hi,Ryan: The problem of filtering reads after bowtie2 mapping ,i conclude that situtation :first ,filter the mapped unmapped reads by using the command -F 4;second ,(when pair-end data ) ,by using the -f 2 ,-q 40 to filter ;third ,by using the sam flag XS ,to filter the reads map to multiple places.But here you say " One possible problem with this is if only one read of a pair can map to multiple places (e.g., the original fragment partly overlapped a simple tandem repeat) then you'd end up with orphans. " I could not understand this possible meaning ,why we need to filter this possible probllem reads .And i could not understand the python script function ?Could you explain this possible problem clearly . Best !
The original goal was to keep pairs that only align once (given the bowtie2 settings) in the genome. That's easy enough if both mates in the pair align only once, but it's problematic if, e.g., read #1 aligns once and read #2 aligns twice (such as to a simple tandem repeat). Since grep -v "XS:i:"
would get rid of read #2, read #1 would be left orphaned (i.e., its mate is missing). My script then filters out such orphans.
Perhaps this works.
GitHub - Extracting read pairs that have been concordantly aligned exactly 1 time #58
Hi, I ran hisat2 -p 8 -x ${2%.*} -1 $r1 -2 $r2 | grep -v "XS:i:" | samtools view -@ 8 -h -F 2316 -f 0x2 - | samtools fixmate - - | samtools sort -@ 8 - -o ${3}/${output}.sorted.bam
. As result I got:
12681015 reads; of these:
12681015 (100.00%) were paired; of these:
1600731 (12.62%) aligned concordantly 0 times
10687314 (84.28%) aligned concordantly exactly 1 time
392970 (3.10%) aligned concordantly >1 times
----
1600731 pairs aligned concordantly 0 times; of these:
48691 (3.04%) aligned discordantly 1 time
----
1552040 pairs aligned 0 times concordantly or discordantly; of these:
3104080 mates make up the pairs; of these:
2379492 (76.66%) aligned 0 times
685374 (22.08%) aligned exactly 1 time
39214 (1.26%) aligned >1 times
90.62% overall alignment rate
However, samtools stats shows that I did not reads which only got aligned exactly 1 time
SN raw total sequences: 22160568
SN filtered sequences: 0
SN sequences: 22160568
SN is sorted: 1
SN 1st fragments: 11080284
SN last fragments: 11080284
SN reads mapped: 22160568
SN reads mapped and paired: 22160568 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 22160568 # proper-pair bit set
SN reads paired: 22160568 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 60310 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 0
SN total length: 2791389998 # ignores clipping
SN total first fragment length: 1395694999 # ignores clipping
SN total last fragment length: 1395694999 # ignores clipping
SN bases mapped: 2791389998 # ignores clipping
SN bases mapped (cigar): 2784413487 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 14140177 # from NM fields
SN error rate: 5.078332e-03 # mismatches / bases mapped (cigar)
SN average length: 125
SN average first fragment length: 126
SN average last fragment length: 126
SN maximum length: 126
SN maximum first fragment length: 126
SN maximum last fragment length: 126
SN average quality: 37.0
SN insert size average: 741.9
SN insert size standard deviation: 1310.4
SN inward oriented pairs: 11072258
SN outward oriented pairs: 8026
SN pairs with other orientation: 0
SN pairs on different chromosomes: 0
SN percentage of properly paired reads (%): 100.0
What did I miss?
Thank you in advance.
Dear Ryan,
I have the same problem and I would like to extract reads aligned concordantly exactly 1 time* from hisat2 output (.sam/bam file). Could you please tell me if it is correct that I should use
samtools view -hf 0x2 -q 3 alignments.bam
?
But why exactly 3?
*I mean: ... reads; of these: ... (...%) were paired; of these: ... (...%) aligned concordantly 0 times ... (...%) aligned concordantly exactly 1 time - these reads! ... (...%) aligned concordantly >1 times
Edit:
I think I found the answer to the question about -q 3
(HISAT2 MAPQ scoring scheme):
60 - uniquely mapped read, regardless of number of mismatches / indels
1 - multiply mapped, perfect match or few mismatches / indels
0 - unmapped, or multiply mapped and with lots of mismatches / indels
So, -q 3 allows to skip alignments with MAPQ smaller than 3. It turns out that I could specify any number less than 60?
And yet, is my code correct? Thanks a lot!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
By "map only one time", do you mean (1) don't have any equivalently good mapping or (2) don't have any other "valid" mappings (where "valid" is defined by the
--score-min
option in bowtie2)?Edit: Since you updated the title to specify concordant-only alignments, the only difference there is the requirement of the 0x2 bit being set in the flag field (i.e.,
samtools view -f 0x2 alignments.bam
would produce only concordant alignments).Thanks for your reply, actually I want to fetch out only those reads that aligned concordantly exactly 1 time using the generated SAM files. In this case I will have to go for option 2 you specified.
Why is this (
samtools view -f 0x2 alignments.bam)
command different from the one below (samtools view -hf 0x2 alignments.bam)
? Why is the h doing? Could that be reason for what I'm seeing here (other post: Concordant pairs bam file larger than accepted_hits bam file?)The -h simply tells samtools to print out the header as well as the reads. Thus if you were to pipe this into another bam file (by also using the -b option) you have a fully formed bam file.
This has helped me too. How did you run bowtie2? I have been using the
-k2
setting then removing any with theXS:i
flag.