Question

microRNA Illumina Sequencing - Very low alignment rate

6

Entering edit mode

10.3 years ago

wynstep ▴ 90

Dear colleagues,

I want to share my strange experience, to ask your opinion and help.

I'm working on microRNA sequencing of a never-studied plant. I received raw data from our service company, as FASTQ file. 30 million read, 50bp length.

I already did RNAseq analysis and I was quite familiar with several tools such as fastqc, trimmomatic, bowtie2, bowtie etc...

I removed the 3' adapter provided to me by the service company. They used Illumina technology. Quality control confirmed that the adaptor sequence is right.

After adaptor trimming, I have great peaks between 19 and 39 bp (first strange thing for me...) and also some minor peaks between 39 and 50...

I downloaded the "hairpin.fa" file from MirBase, without filtering for a specific organism, changing all U in T and removing items with uncommon chars (Y,X,K etc...).

The alignment rate at this step is really low... about 3%

So I did the alignment again, this time versus the A.thaliana genome. The alignment rate increased to 65% (reads aligned only 1 time about 15%). If I launch htseq-count in order to count alignments in genome regions coding for microrna, I found 0 values for all!

I really tried everything and I don't know how to solve this problem:

adaptor trimming with: trimmomatic, cutadapt, fastx-clipper, novoalign
mapping with: bowtie, bowtie2 (using local parameter...), mirdeep2, novoalign

Waiting for your help and suggestions!

Thank you in advance

sequencing Assembly RNA-Seq • 6.9k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.3 years ago by wynstep ▴ 90

2

Entering edit mode

If you got 65% alignment to the genome, you should probably take a look at the alignments with regard to location. It sounds suspiciously like you got DNA sequences and not miRNA, but a quick look at the alignments will give you a good sense of whether that is the problem.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by Sean Davis 27k

1

Entering edit mode

You can also check alignment against non-coding RNA http://www.ensembl.org/info/genome/genebuild/ncrna.html. A collaborator got better with subsequent preparations from IIRC 50% NCRNA (a lot or rRNA, tRNA) down to 10%. In 2 preparations at the beginning I could also detect a CT rich sequence that was neither the used adaptor nor any known blast result instead of the adaptor. This lead to too long sequences - similar to your case. If you really have a lot longer than 23nt, I would check also check these for common 3' sequences.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by Ido Tamir 5.2k

0

Entering edit mode

Ido thank you for your suggestion! I used miRDeep2 to align again my reads and I've found few miRNAs expressed...

I will try to filter sequences longer to 23 nts, search for common 3' sequences (I have to trim them???) and then try alignments with ncrna fasta files.

Thank for your help!

ADD REPLY • link 10.3 years ago by wynstep ▴ 90

0

Entering edit mode

Sorry I took long to answer, but BioStars does not allow me to answer more then 5 times in less then 6 hours. Restrictions for potential spammers :)

Update: using miRDeep2 with hairpin and mature sequences for all plant species, in order to find levels of known miRNAs, gave to me some results. Few miRNAs, expressed in lot of copies!

I have read a paper where this strange alignment seems to be due to a ligation bias of illumina adapter. Do you have any idea on how to try to solve or reduce this problem?

as Ido Tamir suggested I will also try to align all discarded reads, against the RFam fasta, in order to find possible matches with other ncRNAs... what do you think?

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by wynstep ▴ 90

0

Entering edit mode

We have adaptors with 4 random nucleotides at the 3'end. i.e. the ligation product is ADAPTORNNNNSAMPLEINSERT . Of course then you have to adjust the pipeline appropriately. The idea is that the random 4 nt should overcome ligation bias and ligate with everything. like http://www.biooscientific.com/Portals/0/White%20Papers/Randomized-Adapters-for-Reducing-Bias-in-Small-RNA-Seq-Libraries.pdf

I forgot now if both adaptors have the random sequence or just the 5' one.

ADD REPLY • link 10.3 years ago by Ido Tamir 5.2k

0

Entering edit mode

I filtered all the reads with more than 23 nucleotides. Launched versus RFam and found that about 97% are ncRNA but not miRNAs! Is it normal?

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by wynstep ▴ 90

0

Entering edit mode

how many percent are these? maybe size selection was not good?

ADD REPLY • link 10.3 years ago by Ido Tamir 5.2k

0

Entering edit mode

In what sense? I used all the discarded reads from the alignment (using mirdeep) versus precursor and mature mirnas. is the 97% of all the discarded reads (about the 97% of all the initial aligned reads :P).

I don't know if I answered to your question.

I also found a strange thing inside the discarded reads. I show you an example for the most "populated" reads inside my file:

>seq_0_x1357010 TCCGTTGTCGTCCAGCGGTTAGGATATCTGGCT # similar-seq-1
>seq_1357010_x942324 GCGCCTGTAGCTCAGTGGA 
>seq_2299334_x259229 TCCGTTGTCGTCCAGCGGTTAGGATATCTGGC # same as similar-seq-1
>seq_2558563_x250361 CCCAGTCCCGAACCCGTCGGC # similar-seq-2
>seq_2808924_x194922 GTCGTTGTAGTATAGTGGTGAGTATTCCCGCCT
>seq_3003846_x174056 GGTGGCTGTAGTTTAGTGGTAAGAATTCCACGTTGT
>seq_3177902_x165695 TCCGTTGTAGTCTAGTCGGTTAGGATATTCGGCT # same as similar-seq-1
>seq_3343597_x130364 CCAGTCCCGAACCCGTCGGC # almost the same as similar-seq-2 (less one C at the beginning)
>seq_3473961_x104610 TCCGTTGTCGTCCAGCGGTTAGGATATCTGGCTTTC # almost the same as similar-seq-2 (with an extra TTC at the end)

I tried to highlight common regions. There's no common sequences on 3'. It seems that one is a sub-sequence of another... These reads, for example, align with tRNAs into RFam...

Thanks a lot!!!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by wynstep ▴ 90

0

Entering edit mode

If you used htseq-count, then what did you use as the annotation file? Wouldn't it just make more sense to count how many non-multimappers align to each contig? Better yet, have you tried using tools like mirDeep, which are designed exactly with microRNAs in mind?

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

When I aligned against the A. thaliana genome, I used the annotation file provided by MirBase (gff3 file).
Against the precursors file, I desisted to continue with htseq-count, because of a too much low alignment rate.

I also tried with mirDeep2, without success! :( the alignment rate is always too much low... I was thinking that maybe is a problem of adaptor trimming... what do you think?

ADD REPLY • link 10.3 years ago by wynstep ▴ 90

0

Entering edit mode

This is the results of alignment without trimming the adapter:

bowtie2 --local -L 20 -N 0 -x hairpin -U Original_Sequence.fastq -S alignment.sam

31290964 reads; of these:
  31290964 (100.00%) were unpaired; of these:
    30973063 (98.98%) aligned 0 times
   94741 (0.30%) aligned exactly 1 time
    223160 (0.71%) aligned >1 times
1.02% overall alignment rate

If I provide the trimmed and collapsed fast, the percentage is around 4%

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by wynstep ▴ 90