Hi! My last RNA-seq was ruined by presence of a huge amount of strange sequences that contain adapters and polyA. A firm that conduct sequencing told me that it's definitely not adapter-dimers because there's no a standard pick at 120bp (and samples were cleaned using beads twice!). I'm pretty sure that real problem is in a sample preparation step, but I don't understand the mechanism behind this error. So, here's a protocol. It's a bacterial RNA-seq (Rhodobacter Sphaeroides), log-phase, nothing special. RNA was extracted using phenol-chloroform method and checked with agarose gel. Then we prepared rna-seq libraries with NEBNext Ultra™ II Directional RNA Library Prep Kit (NEB, E7760S) following standard protocol: fragmentation, RT with hexamers, blunting and ligation of adapters (in out case we used Illumina TruSeq Single Indexes). We already performed this protocol with good results. Next we did PCR amplification. There are three steps when we clean the resulting DNA with beads (AMPure XP): after cDNA synthesis, after adapter ligation, and after amplification. I should mention that we already performed this protocol with good results. Sequencing was conducted on NovaSeq6000 (pair-end, 2x100bp).
Here are 4 links - The first few dozen lines from the fastq files of reads one and second of one sample:
@A00835:195:HW53LDRXX:1:2101:5882:1000 1:N:0:CGATGTAT
GNTCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
F#:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF
@A00835:195:HW53LDRXX:1:2101:10384:1000 1:N:0:CGATGTAT
GNTCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTGAAGCACACGTCTGAACTCCAGTCACCGATGTATCGCG
+
F#FFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,FFF:FF,FFFFF:::F,::F,:F,:F,,F::F:F:F
@A00835:195:HW53LDRXX:1:2101:14778:1000 1:N:0:CGATGTAT
GNTCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
:#:FFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFFF,FF:FFFFFFFFFFFFFFFFFFF:FF:FFFFFFF:FFF,FFF::FFFFFF:FFFFFFF
@A00835:195:HW53LDRXX:1:2101:5882:1000 2:N:0:CGATGTAT
GGGGGGGGGGGGGGGGGGGGGGGGGGGATTGTTTATAAAAAAAAGATAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFF,FF,FFF::FF,,FFFF,F,,,,,,:,,,,:,F:,F:FF:,F,,,,,:F,FF:F,,:,F:F,::F,FFFFFFF,F:FFF:FFFF,,FF,FFF::
@A00835:195:HW53LDRXX:1:2101:10384:1000 2:N:0:CGATGTAT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGGGGAGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFF:F:FF,FFFFFF,,,,::,,,,,,,F,,FFF,,,,,F:,:,:FF,FFFF:F,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF
@A00835:195:HW53LDRXX:1:2101:14778:1000 2:N:0:CGATGTAT
GGGGGGGGGGGGGGGGGGGGGGGGGGGAATTTATTTAAAAAAAAAGAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAG
+
FF,FFFF,F,:FF:FFF:::FFF:F,,,F:,,,,,,F,F,:,:,,,:,:,F:,FF,F,,::,,,,,F:F:FF,,FFFF,:,F:::FF,FF,FF,:F:,,F
@A00835:195:HW53LDRXX:1:2101:18683:1000 2:N:0:CGATGTAT
GGGGGGGGGGGGGGGGGGGGGGGTGAGTGCGTGTTGTTAATAAAACTAAACAAAAGAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
F:,,F:F:F,,,,:,,:F,,,,FF,,:F,,,,,F:,::,,,,,,,:,,,,:F::,:,,,F::FF,,:,,,,,:,F,:F:F:F,,::F,,::F,FFF,F,:
And FastQC analysis for these files:
As you can see almost whole reads from file R1 contain adapter sequences (index adapter 2 TruSeq in this case), then short polyA-part and then polyG (which corresponds to the fact that the device does not know the nucleotide, I assume). At first I thought they were adapter-dimers, but there is also a R2 file where there is a huge amount of A and G nucleotides (T is also present in some reads) without adapter sequences. Now I think this is a very strange sequence - the polyA on the end of the adapters in the first reads might give a clue to what happened, but I can't figure it out yet.
I will be very grateful if someone explains to me the reason why this could happen (and what I have in result). I really want to eliminate it so as not to lose money and samples.
Please use these directions to post images: How to add images to a Biostars post
You can copy/paste a few example reads directly inside the post. Use
10101
icon in edit window to highlight and format the data as `code.This almost sounds like a library issue or sequencing issue. Have you confirmed with your sequencing provider that there were no problems with the run your samples were on? If sequencing was fine then you have to go back and check on the libraries.
Yes, I have samples in this run which do not have this problem and are perfectly sequenced.
Parts of R1 that are real sequence are hitting same organism when blasted at NCBI. So if that is the organism this data is supposed to be from these samples may have been over fragmented leading to very short inserts perhaps?
There's almost no real sequences here. I can see only adapter (and I don't understand why it's here in such orientation). All that yellow part in picture is actually an adapter.
Ok. So after scanning and trimming for adapters nothing is coming through then that leaves bad libraries. Since this was a kit specific problem the best course of action is to follow up with the vendor tech support.
Part are bad, part are good. But what is really strange - we used the same sample preparation for all of them.
See what the vendor says. They may have specific recommendations on sample clean-up etc. They will likely want to see sample/library QC. Stuff happens with biology and it is possible that if the libraries are redone they may work fine this time around. I assume there was no obvious batch effect (e.g. different technicians, lot of kit etc).
Ok, i will write them. Thank you!