Hi everyone,
I have paired end RNA-sequencing samples where the mates in the two paired files are of unequal lengths:
For e.g.:
Original reads
R1:
@NB501069:25:HY3KCBGXX:1:11101:16049:1105 1:N:0:NGACCA
CTCGTGGGGGGGCCGGGCCACCCCTCCCACGGCGCGACCGCTNNCCN
+
AAAAAEEEEEAEEEEEEEEAEEEEEEEEEEEAAEEE/EEAA/##6E#
R2:
@NB501069:25:HY3KCBGXX:1:11101:16049:1105 2:N:0:NGACCA
CTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNCCGCGCGGCACCCCCCCGTCGCCGGGGCGGGGG
+
AAA############################################################/####/E/E<EA<E/EEEEEEEE/A/E<AEEAEEEE//
Using trim_galore hasn't made any difference:
R1:
@NB501069:25:HY3KCBGXX:1:11101:16049:1105 1:N:0:NGACCA
CTCGTGGGGGGGCCGGGCCACCCCTCCCACGGCGCGACCGC
+
AAAAAEEEEEAEEEEEEEEAEEEEEEEEEEEAAEEE/EEAA
R2:
@NB501069:25:HY3KCBGXX:1:11101:16049:1105 2:N:0:NGACCA
CTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNCCGCGCGGCACCCCCCCGTCGCCGGGGCGGG
+
AAA############################################################/####/E/E<EA<E/EEEEEEEE/A/E<AEEAEEEE
This is another sample:
Original file R1:
@NB501069:23:HYV7KBGXX:1:11101:19650:1064 1:N:0:CTTGTA
CCCAGNCTGGAGTGCAGTGGCATTGTCATAGCTCACTATAACCTCAAATTCCTCAACTCAAATGATCCTCCCACCTCAGCCTCCCAAGTAGCTAGGACTAC
+
AAA6A#EEAEEEAAAAEEEAEEEE/A/E/EE<EEE/EE/EEEEEEEEEEEEAEEEEEEEEEEEE/EEEEE</E<EEEEAEEEE/AEEE<EE/AEEEEEEE<
@NB501069:23:HYV7KBGXX:1:11101:1659:1064 1:N:0:GTTGTA
CAGGGTTGGAAGAGCTGGCCTCGCCTTTCGGCTCCTTTCTCGTCTTGGCCGCGCCGCGGCGTAGGTCCAGCTTGAGCTGCTGGTTCTGCTGGAGCAGGGTG
+
AAAAAEEEEEEAEEEEEEEEEEE<EAEEEEEAE/EEEEAEEAEEEE/EEEEEA//EE<EAEA//EEEAEEE/E<//</A6E<EEE<EE6AAEAE6<AEEE/
@NB501069:23:HYV7KBGXX:1:11101:3487:1064 1:N:0:CTTGTA
AAGAATCAGCAGCCAATCCTCAAAGTTTAAATCATTTAAGGAAATGGGGAAACAAAATTCCAGGTAAATAACAAGACTGAAAAACTAGATTTAAAATAGTG
+
AAAAA6EEEAEEEEEEEEEEEEEEE6EAEEEEEEEEEEEAEEEEAEAA<EEEEEEEEEEEEEEEEEAEEE/EEEEEEEEEAAEEE<AEAEE/EEEEEEEA/
@NB501069:23:HYV7KBGXX:1:11101:12495:1064 1:N:0:CTTGTA
CATTATTTGGAATTCCTGCGACTGTTTCCCTATCAGTATCCTCTGCTGGCCTCTTTACAGTTTTGCATTCTGCTGTGCCATTTGTAGACCGAACGTC
+
AAAAAAEEEAAEEEEEEEEE<EEAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEAEE<AAEEEAEE
First few reads after trimming R1:
@NB501069:23:HYV7KBGXX:1:11101:19650:1064 1:N:0:CTTGTA
CCCAGNCTGGAGTGCAGTGGCATTGTCATAGCTCACTATAACCTCAAATTCCTCAACTCAAATGATCCTCCCACCTCAGCCTCCCAAGTAGCTAGGACTAC
+
AAA6A#EEAEEEAAAAEEEAEEEE/A/E/EE<EEE/EE/EEEEEEEEEEEEAEEEEEEEEEEEE/EEEEE</E<EEEEAEEEE/AEEE<EE/AEEEEEEE<
@NB501069:23:HYV7KBGXX:1:11101:3487:1064 1:N:0:CTTGTA
AAGAATCAGCAGCCAATCCTCAAAGTTTAAATCATTTAAGGAAATGGGGAAACAAAATTCCAGGTAAATAACAAGACTGAAAAACTAGATTTAAAATAGT
+
AAAAA6EEEAEEEEEEEEEEEEEEE6EAEEEEEEEEEEEAEEEEAEAA<EEEEEEEEEEEEEEEEEAEEE/EEEEEEEEEAAEEE<AEAEE/EEEEEEEA
@NB501069:23:HYV7KBGXX:1:11101:12495:1064 1:N:0:CTTGTA
CATTATTTGGAATTCCTGCGACTGTTTCCCTATCAGTATCCTCTGCTGGCCTCTTTACAGTTTTGCATTCTGCTGTGCCATTTGTAGACCGAACGTC
Looking at the distribution of the read lengths, majority of them are 100 bp long.
My goal is to retrieve fusions from the RNA-Seq data. I am able to run STAR-Fusion
on this despite of the unequal mate lengths but I am unable to run chimeraScan
because of this exact reason.
Is it possible to trim the reads in such a way as to create mates of equal lengths using a trimming tool? More importantly, would that approach be recommended?
Thanks!
Interesting question. What if instead of trimming you can add Ns?
komal.rathi : Hopefully not all of your R2 read data looks like that (I assume these are just the first few reads). How and what was done to this data to get them in this state (on sequencer trimming?) Are those N's a result of masking the adapter? If R1 reads are indeed trimmed then you may have short inserts in this data.
I have edited my question to reflect that I had trimmed the reads using trim_galore.
I have a suspicion that this data is pre-trimmed (on sequencer/BaseSpace) which is why you have unequal length reads. If majority/all of your R2 reads have N's (>50% of the read) like that then this appears to be pretty bad data (unless the bases have been deliberately masked). Not sure if it can be used/trusted to find fusions.
Yeah I guess this question needs more information than I have put - I need to talk to the biologists who generated this data. I will clarify some things and add the details in the question.