mismatch setting in bowtie2
1
0
Entering edit mode
7.6 years ago
Bioinfonext ▴ 470

I am using Bowtie 2 version 2.1.0 for pair end RNAseq reads mapping to the CDS(Protein coding gene sequences). I am not able to understand the default setting of mismatch in Bowtie2.

I can see there are two option related to mismatch:

--mp : max penalty for mismatch;lower qual = lower penalty (6)

-N : mismatches in seed alignment; can be 0 or 1 (0)

Please suggest what is the difference between these two and how I can adjust mismatch during mapping.

My aim is to map pair-end read to reference CDS (protein coding gene sequences) and to do raw read count.

RNA-Seq • 7.5k views
ADD COMMENT
6
Entering edit mode
7.6 years ago

So, let's break down the key concepts:

  • max penalty for mismatch refers to the penalty applied to assign a mismatch. When you align two seqeunces every matching position gets a score and so does a mismatching position. The score of the mismatch position is defined as penalty because the base is different, and will lower the overall score of the alignment. A mismatch can be a naturally occurring event, so we don't want to throw away everything, we just apply a penalty which is the value there defined (6). To understand more, read: http://www.gsic.titech.ac.jp/supercon/supercon2004-e/alignmentE.html

  • mismatches in seed alignment refers to how many mismatches you want in the seed of your alignment. A seed is the first match that a read finds on the reference, which is by default 22 nucleotides long in bowtie2 but can be changed. Shorter seeds = more matches and less sensitivity, long seeds = less matches and high sensitivity, it depends on your analysis. To understand more, read: https://www.nature.com/scitable/content/examples-of-how-alignment-seeds-work-55845

If you want a different approach, you could try mapping with a different software, for example BLAT, that allows a certain range of sequence identity. https://www.ncbi.nlm.nih.gov/pubmed/11932250

ADD COMMENT
0
Entering edit mode

Thanks a lot. Can you please suggest what is overall default mismatch for a read in bowtie2 other than seed alignment? How can we change it?

ADD REPLY
2
Entering edit mode

Is there anything that would suggest you to change the default parameters in your experiment? They tend to perform well in most situations. Moreover, regarding your alignment to CDS are you on purpose not taking into account the spliced reads? Because bowtie is not a splice aware aligner.

ADD REPLY
0
Entering edit mode

Radek, I have seen some previous publication, for mapping against genome generally spice aware aligner is used but for mapping against transcriptome or CDS, it is not needed .......please share the link of latest publication where splice aware alinger is used for mapping against CDS if you recommending.

ADD REPLY
1
Entering edit mode

You're right in not using a splice-aware aligned on a transcriptome. However, you should consider moving to HISAT2 in since they're curating that one instead of bowtie2 and tophat2.

https://ccb.jhu.edu/software/hisat2/manual.shtml

ADD REPLY
0
Entering edit mode

Thanks. They written that, HISAT 2 is developed based on the HISAT and Bowtie2 implementations. You are correct, it will good to use Hisat2 instead of Bowtie2.

But again, in Hisat2, I am not able to understand one thing from long time. I have a strand specifc RNAseq library, so should I map reads by using default setting or should give strand specific option?

there are two option for strand specificity in Hisat2:

1) --rna-strandness:For paired-end reads, use either FR or RF, With this option being used, every read alignment will have an XS attribute tag: '+' means a read belongs to a transcript on '+' strand of genome. '-' means a read belongs to a transcript on '-' strand of genome

2) --fr/--rf/--ff: The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand

Do you think --rna-strandness should be used when reads are mapped to genome only otherwise for transcriptome mapping I should use only ---fr .

ADD REPLY
1
Entering edit mode

Put it this way:

If you don't use RNA strandness you will most likely get the best result for each read anyway. However, if there is a gene duplication + inversion and you know your read comes from the forward strand, you could map it with the same score on the reverse strand (where the duplicated + inverted gene is). Therefore, background noise.

The fr, rf, ff depends on the architecture of the sequencing construct, which for illumina is (correct me if i'm wrong, always --fr).

EDIT:

WOW, Biostars supporting video embed from youtube, awesome.

ADD REPLY
0
Entering edit mode

If this answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLY

Login before adding your answer.

Traffic: 1553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6