Pattern Matching - Biological Compatibility Between Query Reads And Reference Sequence?
2
2
Entering edit mode
12.9 years ago
Agatha ▴ 350

Hi,

I was trying to manually align RNA-seq reads to a RNA sequence in R using Biostrings.

Considering that my sequences are collapsed

 A DNAStringSet instance of length 1400027
      width seq                                                               names               

  [1]    22 GTCTGTGATGAATTGCTTTGAC                                            1-1694429
  [2]    22 AGTCTGTGATGAATTGCTTTGA                                            2-1546669
  [3]    22 GCATTGGTGGTTCAGTGGTAGA                                            3-928598

And I want to map them using my own algorithm that uses the matchPattern function to a pre-miRNA seq downloaded from mirBASE.

72-letter "RNAString" instance
 seq: UGUCGGGUAGCUUAUCAGACUGAUGUUGACUGUUGAAUCUCAUGGCAACACCAGUCGAUGGGCUGUCUGACA

How should I actually convert my reads in order to be biologically compatible? Is it relevant to compare match(DNAString, DNAString(reverseComplement(RNAString_instance)) -considering how the reads are generated using Illumina?

Otherwise, how should I convert either the source or the destination string for a correct mapping?

illumina rna r mirna cdna • 3.4k views
ADD COMMENT
1
Entering edit mode

I would recommend to also align the reads against the reference genome unless the protocol somehow selected specifically (pre)miRNA, because otherwise you might be seeing repeated sequences from normal transcripts and that might go unnoticed. Then check the regions of interest in the genome for high coverage, (possibly after removing duplicate hits).

ADD REPLY
2
Entering edit mode
12.9 years ago

As much as I like R I am ambivalent to recommend it for first round of mapping -> Bowtie/BWA/Novoalign/BFast/etc.. should be able to properly map reads to hairpins. Once they are mapped, realigning using dynamic programming is not too costly.

Here is some code I use to do that (sorry it's a bit messy):

pwa<-pairwiseAlignment(targetseq,queryseq,type="local-global")

 #used for target positions
  target_poss_sub<-unlist(pwa@pattern@mismatch)
  #used for identifying the edits
  seq_poss_sub<-unlist(pwa@subject@mismatch)

  #pattern indels is given relative to the subject
  #pattern: [15] TCCCTGAGACCCTT-TAACCTGT 
  #subject:  [1] TCCCTGAGACCCTTTTAACCTGT 
  #"30i"
  target_poss_indel_ins<-concat(as.character(start(unlist(pwa@pattern@indel))+pwa@pattern@range@start),'i',sep='')
  seq_poss_indel_ins<-start(unlist(pwa@pattern@indel))

  #we won't see deletions on the target side
  #1234567890123456789012345678901234567
  #TGCCAGTCTCTAGGTCCCTGAGACCCTTTAACCTGTGAGGACATCCAGGGTCACAGGTGA
  #              TCCCTGAGACCCTT-AACCTGTG
  #pattern: [15] TCCCTGAGACCCTTTAACCTGTG 
  #subject:  [1] TCCCTGAGACCCTT-AACCTGTG 
  target_poss_indel_del<-concat(as.character(start(unlist(pwa@subject@indel))+pwa@pattern@range@start-1),'d',sep='')
  seq_poss_indel_del<-start(unlist(pwa@subject@indel))

  target_poss<-c(target_poss_sub,target_poss_indel_ins,target_poss_indel_del)
  seq_poss<-c(seq_poss_sub,seq_poss_indel_del,seq_poss_indel_ins)

  nucs<-strsplit(queryseq,'')[[1]][seq_poss]
  edits<-data.frame(nucs=nucs,poss=target_poss)
  editList<-list()
  if(N_is_reference){
    editList<-as.list(edits[edits$nucs!='N',])
  }else{
    editList<-as.list(edits)
  }
  #remove N's
  maskedSeq<-maskedEdits(as.character(subject(pwa)),as.character(pattern(pwa)))
  spacedSeq<-str_pad(maskedSeq,pwa@pattern@range@start+pwa@pattern@range@width-1)

  #formatEdits(nucs,target_poss)

  list(editList=editList,spacedSeq=spacedSeq)

maskedEdits<-function(subject,pattern){
  concat(apply(X=rbind(strsplit(subject,'')[[1]],strsplit(pattern,'')[[1]]),MARGIN=2,FUN=function(x){if(x[1]=='N'){x[2]}else{if(x[1]=='-'){''}else{x[1]}}}),collapse="")
}
ADD COMMENT
0
Entering edit mode

@Jeremy Leipzig-thank you, I had a look at the code, I will try it asap to see what I am getting. As I have already mentioned, I did align them using bowtie, and none of the reads mapped to the a certain pre-miRNA (true positive)..Even though it sounds trivial, i think it has something to do with what Michael said, whether the sequencing platform is single or strand specific..and according to that I will have to transform my ref sequences into a format that is compatible with my query reads. Do I need to follow any other preprocessing steps of the data before mapping?

ADD REPLY
0
Entering edit mode

@Jeremy Leipzig I meant: should my reads be as they are (dna reads) mapped to the hairpins? Maybe this may sound trivial but this is my first contact with sequencing data and I don't know exactly how to approach it, when I cannot find what I am looking for, knowing that it should be there.

ADD REPLY
0
Entering edit mode

well it could be one of a dozen or so things - but all aligners know about reverse complements, so that is not one of them. using a simple grep (i.e. search in your favorite text editor) find a read that you think should map to a hairpin and post it (and the hairpin) as a new Biostar question entitled 'Why didn't this read map to this sequence"

ADD REPLY
1
Entering edit mode
12.9 years ago
Michael 55k

Whether or not you have to consider matches in both strands depends on whether or not your protocol is strand specific. Afaik, there are both strand specific and non-strand specific sequencing protocols for RNA-seq with Illumina around, but you should ask the sequencing lab about this. Note that the name of the function is matchPattern not match, and that pattern must be a single string or DNAString object not a DNAStringSet.

If you want to use a mapping of reads to a database of multiple RNA sequences (and want to do that in R of course) then matchPDict with a preprocessed dictionary might be much faster. In addition, it might be reasonable to check if there is any chance of mismatches and if the tools allow mismatches and or indels. If that is required (matchPDict supports mismatches but no indels, matchPattern does but might be a bit slow) and you have a lot of sequences, an external tool might be a better choice.

Edit: if you don't know and cannot find out if the protocol was strand-specific, then I would simply assume it was not and test for both strands, better than to miss out on some hits.

ADD COMMENT
0
Entering edit mode

@Michael Dondrup- yes I am aware of that, I tried to write as pseudocode..I want to loop through the reads, and for each one of them, test where it aligns to the rna seq with different mismathces..I got my data from NCBI so I am not sure if I can get that info from somewhere..thank you

ADD REPLY
0
Entering edit mode

Well, then I would simply test both strands. How many sequences do you have and how large is your reference database? If they are of the size of a real illumina run, the R functions might be too slow anyway.

ADD REPLY
0
Entering edit mode

Around 1400000 reads and 1527 ref sequences..I have also tried Bowtie, but for some reason, the reads did not map to a miRNA that is very present in the tissues from which the rna was extracted and sequenced and I am not sure what I did wrong so I am trying to do it manually. How do I know, when I test both strands which one is correct or not?

ADD REPLY
0
Entering edit mode

I think that will take forever with the R functions, especially if you loop over the 1.4e7 reads. Try bwa or bfast then if you cannot get a match of a true positive example. With the two strands, actually there is no way of telling directly, but you can look at the surrounding sequence of the alignments and check if they fulfill the biological properties set for your pre-miRNA.

ADD REPLY
0
Entering edit mode

thank you, Michael. One more question, if possible. What could be the explanation of the miRNA missing in the alignment file? Is it related to the preprocessing steps? Is it possible for the respective miRNA not to be present in a certain sample? Some of the miRNAs reported by the already published paper as miRNAs with max number of hits have the highest number of hits in my alignment...however, that miRNA it's completely absent (that miRNA is of interest for my analysis)

ADD REPLY

Login before adding your answer.

Traffic: 1684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6