Entering edit mode
11.0 years ago
xiaojuhu13
▴
150
I have several microRNA data(illumina sequencing ) to trim first for further analysis. After I use the command cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCA -e 0.1 -O 5 -m 15 -o sheep_48_1_trim.fastq sheep_48_1_extract.fastq , there are still too many reads contain a long length rather 20-23nt. Then should I check the fastqc results file Overrepresented sequences, then compare them with miRBase data to exclude microRNA sequences and remove the real adaptor, but it is a huge job. Or it 's a wrong way to do the trimming analysis for microRNA. This is the Overrepresented sequences included in fastqc files:
>>Overrepresented sequences fail
#Sequence Count Percentage Possible Source
TACCCTGTAGAACCGAATTTGTTGGAATTCTCGGGTGCCAAGGAACTCCA 1415391 4.200865697255984 RNA PCR Primer, Index 1 (
TTCAAGTAATCCAGGATAGGCTTGGAATTCTCGGGTGCCAAGGAACTCCA 1052074 3.1225446378950354 RNA PCR Primer, Index 1 (
GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCTTGGAATTCTCGGGTGCC 827120 2.4548835166497236 No Hit
AACATTCAACGCTGTCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC 796866 2.365089960802059 RNA PCR Primer, Index 1 (
TGCCTATGCTGAAACCCAGAGGCTGTTTCTGAGCTGGAATTCTCGGGTGC 499804 1.4834130490806638 No Hit
AACATTCAACGCTGTCGGTGAGTGGAATTCTCGGGTGCCAAGGAACTCCA 460851 1.3678009521369836 RNA PCR Primer, Index 1 (
TGAGATGAAGCACTGTAGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGT 423765 1.2577300916832748 RNA PCR Primer, Index 1 (
TATTGCACTTGTCCCGGCCTGTTGGAATTCTCGGGTGCCAAGGAACTCCA 419685 1.2456206943190098 RNA PCR Primer, Index 1 (
TGAGGTAGTAGGTTGTATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCA 414839 1.2312378169593952 RNA PCR Primer, Index 1 (
TACCCTGTAGAACCGAATTTGTGTGGAATTCTCGGGTGCCAAGGAACTCC 378938 1.1246840241225131 RNA PCR Primer, Index 1 (
TGAGATGAAGCACTGTAGCTCTGGAATTCTCGGGTGCCAAGGAACTCCAG 341312 1.013010449311769 RNA PCR Primer, Index 1 (
TGTCTGAGCGTCGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTT 298504 0.8859567526525888 RNA PCR Primer, Index 1 (
TGAGGTAGTAGATTGTATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCA 294079 0.8728233988935513 RNA PCR Primer, Index 1 (
If I don't get you wrong you just want to get rid of your adaptor sequences and stay with miRNAs. AFAIK miRNA are approx 19-24 bp long. Assume you did an 100bp single end Illumina sequencing you should always sequence round about 81 - 76 bp of your adaptor / nonsense sequence. What you can do is, eyeball some of your reads and look for the position the actual miRNA starts (nucleotide diversity should increase). Trim all the reads at that position. That way you should get rid off most of your adaptors.
Thanks, I check the overrepresented sequences and the illumina adapter(TGGAATTCTCGGGTGCCAAGGAACTCCA), after removing these sequencs, too many reads still have a long length rather than 19-24nt. And I search the miRBase dateset, the adaptor sequences above appeared some mature miRNA like bfl-miR-182b-3p, so what should I do next?
have you checked both ends?
what does both end mean?
I am removing the adapter I find in the Overrepresented sequences, but there are still too many reads have a long length, so I run fastqc again to find more adapters, to check whether they are the real adapters, I cope these finding sequences to the miRBase database. To get all adapters, I have already do five fastqc check.
The problem with FastQC is that it only checks the first 200k sequences so you will end up with - I don't know how many iterations of FastQC. What I mean is actually look at your sequences in a texteditor (on a *NIX system e.g. use 'less myseq.fastq' and look for adapters manually. There will be a point (so if you write them line by line this is one specific column) in the sequences where nucleotide diversity goes up. That is where you actual miRNA starts.
I do the analysis in your way, and I check the sequences like following, GGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGCTGGAATTCTCGGGGTCCAAGGAACGCCAGTCACTTAGGCATATCGTATGCCGT CATTGCACTTGTCTCGGTCTGATGGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA ACAGTAGTCTGCACATTGGTTAATGGAATTCTCGGGTGCCAAGGCACTCCAGTCGCTTAGGCATTTCGTATGCCGTCTTCTGCTTGAAAAAAA CTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTTTAGAAATCTCGGGTGACAAGGAACTCCAGTCACTTAGGCATCTC CTTGCGGCACCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTGGCTTGGAATCCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCAACTCG CACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTAAGGCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAA ACCCTGTAGAACCGAATTTGTTGGAATTCTCGGGTGCCAAGGAAGTCCAGTCACGTAGGCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAA TTCAAGTAATCCAGGATAGGCTTGGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA
the sequences I find similar with each other is GGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA, the sequences is too long, if I use cutadapt , should I set a comparatively high value for the mismatch(-e value)?
yep, give it a try
yeah, I used the command TGGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA -e 0.2 -O 5 -m 15 -o sheep_48_1_re1.fastq sheep_48_1.fastq huxj@LoginNode raw]$ grep ^[ACTGN] sheep_48_1_re1.fastq| head -n 100 GGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGC CATTGCACTTGTCTCGGTCTGA CCCFFFFFHHHDHIJJJIIJJB ACAGTAGTCTGCACATTGGTTAA CTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT CTTGCGGCACCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTGGCT CACCACGTTCCCGTGG CCCFFFFFHHHHHIJJ ACCCTGTAGAACCGAATTTGT CCCFDFFFGHHHHJGGHIJGI TTCAAGTAATCCAGGATAGGCT CCCFFFFFHHHHGJJJJFIIJI AACATTCAACGCTGTCGGTGAGTTT ATCCCGGACGAGCCCCCA GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCT GCCTATGCTGAAACCCAGAGGCTGTTTCTGAGC CCCFFFFFHHHHHJJJIIIJJJJJJJJIJJJJG TACCCTGTAGAACCGAATTTGT CACGCGCACCAACCTCACGGGGCTCATTCTCAGCACGGCTG yeah, still have a long length in some reads, centralized among 32-34nt
huxj@LoginNode raw]$ grep ^[ACTGN] sheep_48_1_re1.fastq| head -n 100 GGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGC CATTGCACTTGTCTCGGTCTGA ACAGTAGTCTGCACATTGGTTAA CTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT CTTGCGGCACCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTGGCT CACCACGTTCCCGTGG ACCCTGTAGAACCGAATTTGT TTCAAGTAATCCAGGATAGGCT AACATTCAACGCTGTCGGTGAGTTT ATCCCGGACGAGCCCCCA GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCT GCCTATGCTGAAACCCAGAGGCTGTTTCTGAGC TACCCTGTAGAACCGAATTTGT CACGCGCACCAACCTCACGGGGCTCATTCTCAGCACGGCTG yeah, still have a long length in some reads, centralized among 32-34nt
on those reads (which are way shorter) try to use jellyfish to identify overrepresented k-mers. After that you should end up with your desired sequences
I think that some miRNA experiments you might REALLY have a huge peak around 32nt, i.e. it's biology, not artifact. However, I am not 100% sure. Give a look to papers published with miRNA data and NGS and see if peaks aroun 32nt were observed...
yeah, there is a peak between 32-33nt.Then should I remove it further or not?