Dear all,
I have 150bp paired end sequences R1 and R2, after I clean the sequences I used sortmerna
to sort out all the reads matching rna databases. so I have two fasta files containing reads like this:
@NB551184:334:HKHYYAFX2:1:11101:8313:12479 2:N:0:0
AAGGTGATATGAACCGTTATAGCCGGCGATTTCCGAATGGGGAAACCCAGTGTGTTTCGACACACTATCATTACGTGAATACATAGCGTAATGAAGCGAACCGGGGGAACTGAAACATCTAAGTACCCCGAGGAAAAGAAATCAACCGAG
+
AAAAAEEEEEEEEEEEAAEEEEAEEEA<EAEEEEEEEEAEEEEEEEAAEE<E<AEAEEA/EEAEEAEEEEE6EEAEAEEEAEEEEEE/EEEE<EEEAEA<EEE/EE/E//<AAEAAEEE/EEAA<<EAAAE<EE6A<//EEEEEA<EEA/
@NB551184:334:HKHYYAFX2:1:11101:26630:12486 2:N:0:0
TAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTCACCGCGTGCACTGGTCCGGCCGGGCCTTTCCCTCTGGGGAACCGCATGCCCTTCACTGGGTGTGTCGGGGAACCAGGACTTTTACTTTGAACAAATT
+
AAAAAEEEEEEEEEEEEEAEEEEEEEEEEEEEAEEEAEEEE<EEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA/EAEEEEEEEEEAEEEEEAA<EEEEEAAEEA/EEEEEEEEAAEEEEAAAEEEEAEE<
And I have four sequences(each is 25bp, with minor differences) that I am interested in and I want to know from all the rna 150bp reads, what's the ratio between these four sequences.
How could I do this?
The original idea here is that: my bacterium has 6 16S, and each one is slightly different from each other, so we want to know under different conditions, how do those different 16S expressed. And we found the differences of these 16S are mainly at a short sequence (25bp), like below:
- CATATCTGACCGGCATCGGTTGGAT
- TATATCTGACCGGCATCGGTTGGAT
- TATATCTGACCGGCATCGGTTAGCT
- CATATCTGACCGGCATCGGTTAGCT
If we make contigs using all these reads then it could mess up with the combination, of T or C at the beginning, or G.A or A.C at the very end. So I decided to used these four 25bp sequences as probe to fish out those reads matching them, and count the number of reads.
Can anyone help with the code or software I can use?
Any idea and help will be highly appreciated!
Thanks
Yanfang
What have you tried? Surely you started with grep?