How To Extract Reads-Pairs Aligned Concordantly Exactly 1 Time?
3
9
Entering edit mode
10.8 years ago
Rahul Sharma ▴ 660

Hi,

I want to extract read-pairs that aligned concordantly exactly 1 time to the reference genome. Is there any way to parse the output SAM file? I would really appreciate your suggestions.

Best regards,

Rahul

bowtie2 mapping • 26k views
ADD COMMENT
2
Entering edit mode

By "map only one time", do you mean (1) don't have any equivalently good mapping or (2) don't have any other "valid" mappings (where "valid" is defined by the --score-min option in bowtie2)?

Edit: Since you updated the title to specify concordant-only alignments, the only difference there is the requirement of the 0x2 bit being set in the flag field (i.e., samtools view -f 0x2 alignments.bam would produce only concordant alignments).

ADD REPLY
0
Entering edit mode

Thanks for your reply, actually I want to fetch out only those reads that aligned concordantly exactly 1 time using the generated SAM files. In this case I will have to go for option 2 you specified.

ADD REPLY
0
Entering edit mode

Why is this (samtools view -f 0x2 alignments.bam)command different from the one below (samtools view -hf 0x2 alignments.bam)? Why is the h doing? Could that be reason for what I'm seeing here (other post: Concordant pairs bam file larger than accepted_hits bam file?)

ADD REPLY
0
Entering edit mode

The -h simply tells samtools to print out the header as well as the reads. Thus if you were to pipe this into another bam file (by also using the -b option) you have a fully formed bam file.

ADD REPLY
0
Entering edit mode

This has helped me too. How did you run bowtie2? I have been using the -k2 setting then removing any with the XS:i flag.

ADD REPLY
18
Entering edit mode
10.8 years ago

The simplest way is likely something along the lines of:

samtools view -hf 0x2 alignments.bam | grep -v "XS:i:" > filtered.alignments.sam

As I mentioned in the comment above, the -f 0x2 part will get only "properly paired" alignments, which will effectively be the concordant alignments. For the option #2 definition of "map only once", you can take advantage of the fact that bowtie2 will add an XS auxiliary tag to reads that have another "valid" mapping. So, a quick inverse grep (grep -v) can get rid of those. One possible problem with this is if only one read of a pair can map to multiple places (e.g., the original fragment partly overlapped a simple tandem repeat) then you'd end up with orphans. The easiest fix to that (if it's a problem) would be to just check if the read names of pairs of alignments are the same. I'm sure someone will come up with a way to do that in awk, but the following python script is likely simpler:

#!/usr/bin/env python
import csv
import sys

f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
last_read = None
for line in f :
    #take care of the header
    if(line[0][0] == "@") :
        of.writerow(line)
        continue

    if(last_read == None) : 
        last_read = line
    else :
        if(last_read[0] == line[0]) :
            of.writerow(last_read)
            of.writerow(line)
            last_read = None
        else :
            last_read = line

If you saved that as foo.py and made it executable and in your PATH, then the following would solve the aforementioned issue:

samtools view -hf 0x2 alignments.bam | grep -v "XS:i:" | foo.py > filtered.alignments.sam

Note that this won't work with coordinate-sorted alignments, but bowtie2 doesn't produce those.

ADD COMMENT
2
Entering edit mode

This didn't work for me. The number of extracted reads does not match the number in the summary made by Bowtie.

An example:

Time loading reference: 00:00:20
Time loading forward index: 00:00:49
Time loading mirror index: 00:00:26
Multiseed full-index search: 00:02:43
1020534 reads; of these:
  1020534 (100.00%) were paired; of these:
    98406 (9.64%) aligned concordantly 0 times
    766336 (75.09%) aligned concordantly exactly 1 time
    155792 (15.27%) aligned concordantly >1 times
    ----
    98406 pairs aligned concordantly 0 times; of these:
      52167 (53.01%) aligned discordantly 1 time
    ----
    46239 pairs aligned 0 times concordantly or discordantly; of these:
      92478 mates make up the pairs; of these:
        51106 (55.26%) aligned 0 times
        14846 (16.05%) aligned exactly 1 time
        26526 (28.68%) aligned >1 times
97.50% overall alignment rate
Time searching: 00:11:38
Overall time: 00:12:29

So I want to extract those 766336 (75.09%) that are aligned concordantly exactly 1 time.

I can get those that are concordantly aligned exactly 1 time or > 1 times (766336 + 155792 = 922128) by:

samtools view alignments.bam | grep "YT:Z:CP" | wc -l

or

samtools view -f 0x2 alignments.bam | wc -l

Which gives 1844256 = 922128*2

Your solution:

samtools view -f 0x2 alignments.bam | grep -v "XS:i:" | foo.py | wc -l

gives 1459732 = 2*729866, which is not the desired result.

ADD REPLY
1
Entering edit mode

A little bite late but I use the following awk scripts on concordantly mapped paired-end reads:

 awk '$0!~/^@/ {key=$0; getline; print key "delimiter you want" $0;next}{print $0}' mysample.sam > collapsed.sam

It actually collapse pairs of rows to put paired-end reads on the same line (ignoring the header). use a delimiter that is not present in your sam file ( '{' or '}' for illumina phred33 for example).

awk '$0!~/XS:i:/' collapsed.sam > filtered_collapsed.sam

Then filter line if the pattern "XS:i:" is found

awk '{n=split($0,a,"delimiter you want"); for (i = 1; i <= n; ++i) print a[i]}'  filtered_collapsed.sam > filtered_unique.sam

and finally split remaining lines according to the delimiter to rebuild a correct sam file. Be careful to sort your sam file by read name so reads from the same pair are consecutive.

hope it helps

ADD REPLY
1
Entering edit mode

This script does not work on Bowtie version 2.3.4.2. I tried both options (also Devon Ryan's solution) but I can not produce the number of reads (aligned concordantly exactly 1 time). Does anyone know the difference in the versions?

ADD REPLY
0
Entering edit mode

I am using version 2.3.4.1, and it does not work for me either. Really need help.

ADD REPLY
0
Entering edit mode

Thank you so much it worked :)

ADD REPLY
0
Entering edit mode

Great reply, really useful!

ADD REPLY
0
Entering edit mode

After further investigation I noticed that the read1 and read2 in the flagstat output where not equal. There was a subset of reads (in my data anyway) that were designated SAM flag 256 "not primary alignment". I removed these using -F 256 alongside -f 2. My original bowtie parameters were --sensitive and -k2.

ADD REPLY
0
Entering edit mode

thanks a lot, I had exactly the same problem. :)

ADD REPLY
0
Entering edit mode

Hi,Ryan: The problem of filtering reads after bowtie2 mapping ,i conclude that situtation :first ,filter the mapped unmapped reads by using the command -F 4;second ,(when pair-end data ) ,by using the -f 2 ,-q 40 to filter ;third ,by using the sam flag XS ,to filter the reads map to multiple places.But here you say " One possible problem with this is if only one read of a pair can map to multiple places (e.g., the original fragment partly overlapped a simple tandem repeat) then you'd end up with orphans. " I could not understand this possible meaning ,why we need to filter this possible probllem reads .And i could not understand the python script function ?Could you explain this possible problem clearly . Best !

ADD REPLY
0
Entering edit mode

The original goal was to keep pairs that only align once (given the bowtie2 settings) in the genome. That's easy enough if both mates in the pair align only once, but it's problematic if, e.g., read #1 aligns once and read #2 aligns twice (such as to a simple tandem repeat). Since grep -v "XS:i:" would get rid of read #2, read #1 would be left orphaned (i.e., its mate is missing). My script then filters out such orphans.

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode
5.5 years ago
m.t.lorenc • 0

Hi, I ran hisat2 -p 8 -x ${2%.*} -1 $r1 -2 $r2 | grep -v "XS:i:" | samtools view -@ 8 -h -F 2316 -f 0x2 - | samtools fixmate - - | samtools sort -@ 8 - -o ${3}/${output}.sorted.bam. As result I got:

12681015 reads; of these:
  12681015 (100.00%) were paired; of these:
    1600731 (12.62%) aligned concordantly 0 times
    10687314 (84.28%) aligned concordantly exactly 1 time
    392970 (3.10%) aligned concordantly >1 times
    ----
    1600731 pairs aligned concordantly 0 times; of these:
      48691 (3.04%) aligned discordantly 1 time
    ----
    1552040 pairs aligned 0 times concordantly or discordantly; of these:
      3104080 mates make up the pairs; of these:
        2379492 (76.66%) aligned 0 times
        685374 (22.08%) aligned exactly 1 time
        39214 (1.26%) aligned >1 times
90.62% overall alignment rate

However, samtools stats shows that I did not reads which only got aligned exactly 1 time

SN      raw total sequences:    22160568
SN      filtered sequences:     0
SN      sequences:      22160568
SN      is sorted:      1
SN      1st fragments:  11080284
SN      last fragments: 11080284
SN      reads mapped:   22160568
SN      reads mapped and paired:        22160568        # paired-end technology bit set + both mates mapped
SN      reads unmapped: 0
SN      reads properly paired:  22160568        # proper-pair bit set
SN      reads paired:   22160568        # paired-end technology bit set
SN      reads duplicated:       0       # PCR or optical duplicate bit set
SN      reads MQ0:      60310   # mapped and MQ=0
SN      reads QC failed:        0
SN      non-primary alignments: 0
SN      total length:   2791389998      # ignores clipping
SN      total first fragment length:    1395694999      # ignores clipping
SN      total last fragment length:     1395694999      # ignores clipping
SN      bases mapped:   2791389998      # ignores clipping
SN      bases mapped (cigar):   2784413487      # more accurate
SN      bases trimmed:  0
SN      bases duplicated:       0
SN      mismatches:     14140177        # from NM fields
SN      error rate:     5.078332e-03    # mismatches / bases mapped (cigar)
SN      average length: 125
SN      average first fragment length:  126
SN      average last fragment length:   126
SN      maximum length: 126
SN      maximum first fragment length:  126
SN      maximum last fragment length:   126
SN      average quality:        37.0
SN      insert size average:    741.9
SN      insert size standard deviation: 1310.4
SN      inward oriented pairs:  11072258
SN      outward oriented pairs: 8026
SN      pairs with other orientation:   0
SN      pairs on different chromosomes: 0
SN      percentage of properly paired reads (%):        100.0

What did I miss?

Thank you in advance.

ADD COMMENT
0
Entering edit mode

hisat2 may not set an XS flag, for it you can filter for MAPQ > 3.

ADD REPLY
0
Entering edit mode

Could you please provide an example?

ADD REPLY
0
Entering edit mode

See the -q option in samtools view.

ADD REPLY
0
Entering edit mode

Dear Ryan,

I have the same problem and I would like to extract reads aligned concordantly exactly 1 time* from hisat2 output (.sam/bam file). Could you please tell me if it is correct that I should use samtools view -hf 0x2 -q 3 alignments.bam? But why exactly 3?

*I mean: ... reads; of these: ... (...%) were paired; of these: ... (...%) aligned concordantly 0 times ... (...%) aligned concordantly exactly 1 time - these reads! ... (...%) aligned concordantly >1 times

Edit: I think I found the answer to the question about -q 3 (HISAT2 MAPQ scoring scheme):

60 - uniquely mapped read, regardless of number of mismatches / indels
 1 - multiply mapped, perfect match or few mismatches / indels
 0 - unmapped, or multiply mapped and with lots of mismatches / indels

So, -q 3 allows to skip alignments with MAPQ smaller than 3. It turns out that I could specify any number less than 60?

And yet, is my code correct? Thanks a lot!

ADD REPLY
0
Entering edit mode

It seems to be working correctly (checked using samtools flagstat). Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 2109 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6