Entering edit mode
7.0 years ago
Lila M
★
1.3k
Hi everybody, I've just get some RNA-seq (single end) that I have to analyze. As always, I first did the QC with fastqc but I was very surprising because Per base content and per sequence GC content fails. I also have a warning for seq duplication levels (65 overrepresented sequences) :
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 46114 0.378499669139054 No Hit
CGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGGCG 43691 0.3586118975659108 No Hit
GTCGATTTGGCGAGGGCGCTCCCGACGACGCACCGGGAGGAGGCCCTTCC 34300 0.2815313928843638 No Hit
GTCGGGGGGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGAC 31342 0.2572523882152108 No Hit
GGGGCCTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCG 30073 0.24683654746972222 No Hit
GGGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGAA 28075 0.23043713863639984 No Hit
GGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGG 27746 0.22773673548016204 No Hit
CCCCCACCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCC 27333 0.2243468676882891 No Hit
GCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCG 27197 0.22323059161154643 No Hit
GGGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGC 25003 0.20522243196174195 No Hit
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG 24424 0.2004700507232566 No Hit
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGC 23847 0.1957340853094293 TruSeq Adapter, Index 2 (100% over 50bp)
CGGGAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGC 23551 0.19330454326004817 No Hit
GGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCG 23251 0.1908421695613511 No Hit
CCTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCGGTCG 22969 0.18852753828457586 No Hit
GTTCGGAAGAGCGGGCCGGGAGAAGACGAGAGACCACGGGCGAGGCCGGG 22640 0.1858271351283381 No Hit
GGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGACGCCACGGGGAC 22538 0.1849899280707811 No Hit
GCCGGGGCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGAC 20570 0.1688367566073284 No Hit
GGGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGG 20185 0.16567671036066717 No Hit
GGCCCGGCGAGGGGGAGAGGCGACGGGAGAGAGAGCGCGCGGCCGACGGC 19917 0.1634769898564978 No Hit
CCCACCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCCCC 19905 0.16337849490854991 No Hit
GCGGGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCG 19537 0.1603579831714815 No Hit
GGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGC 19466 0.1597752213961232 No Hit
CCGACACGCCACCACCACCGTCGCTCGTGATTCTCGTCCATCCTCCGACC 19406 0.15928274665638378 No Hit
CCGGGAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCG 19210 0.15767399583990171 No Hit
GCCGGGAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGC 18647 0.15305294119868024 No Hit
GCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGACGCCACG 18494 0.15179713061234473 No Hit
GACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGAAGC 18409 0.15109945806438058 No Hit
CGTCGCTCGTGATTCTCGTCCATCCTCCGACCCGGTCCCGCTCCGGGAGA 17813 0.14620754231630242 No Hit
GGGGTCTTTAAACCTCCGCGCCGGAACGCGCTAGGTACCTGGACGGCGGG 17374 0.1426042688038757 No Hit
CCACACACGACCGGTCGGAGGCAGAACGGCAGCCCCTCGGCGGCCGGCCG 17071 0.1401172713681917 No Hit
GTCCGGCCCCCGACCCTCGAGACGCCCTAGCGGGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGG 17056 0.13999415268325682 No Hit
CCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCCCCGGAG 16957 0.1391815693626868 No Hit
CCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCGG 16841 0.13822945153252394 No Hit
GGGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGG 16730 0.13731837326400603 No Hit
GGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCGGAGGCGACGGGAATCCGGCCGGCCCCGA 16593 0.13619388927493437 No Hit
GCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGGAGGGCGAGGAGGACG 16590 0.1361692655379474 No Hit
CCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCG 16295 0.13374793140089528 No Hit
CGCTAGAGAAGGCTTTTCTCACCGAGGGTGGGTCACACTCCCCCCACCCGCCAGCCGCTCCTCCTCGGGCCCGC 16237 0.13327187248581385 No Hit
GTCCATCCTCCGACCCGGTCCCGCTCCGGGAGACCGGCGCGCCCCCACCG 16006 0.1313758447378171 No Hit
GACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGG 15993 0.13126914187754024 No Hit
GGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCG 15510 0.12730472022263797 No Hit
GGGAGGGCGAGGAGGACGGGCGGGGCCTCGGAGGAGGGGCGGCGGGGAGG 15451 0.12682045339522757 No Hit
GCGAGGAGGACGGGCGGGGCCTCGGAGGAGGGGCGGCGGGGAGGAGGAGG 15247 0.12514603928011356 No Hit
CTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCGGTCGG 15180 0.12459610915407121 No Hit
GAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAG 15015 0.12324180361978783 No Hit
GGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGAAG 14892 0.12223223040332204 No Hit
CACCGCTAAGAGTCGTACGAGGTCGATTTGGCGAGGGCGCTCCCGACGAC 14860 0.12196957720879435 No Hit
GGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCGGAGGCGACGGGAATCCGGCCGGCCCCGA 14839 0.12179721104988556 No Hit
GGCACGGGCCGGGGGCGGGACGGGCGCCGCACGCCCCGACCCGTCTCCCCCGCGGAGGTCGGGGGGACGGGTCCG 14524 0.11921171866625364 No Hit
CCCGACACGCCACCACCACCGTCGCTCGTGATTCTCGTCCATCCTCCGAC 14416 0.11832526413472269 No Hit
CCGGCGAGGGGGAGAGGCGACGGGAGAGAGAGCGCGCGGCCGACGGCGCC 14402 0.11821035336211684 No Hit
GCCACCACCACCGTCGCTCGTGATTCTCGTCCATCCTCCGACCCGGTCCC 14357 0.11784099730731228 No Hit
CGTGATTCTCGTCCATCCTCCGACCCGGTCCCGCTCCGGGAGACCGGCGCGCCCCCACCGTGGGACGCTTTCCC 14175 0.11634715726343606 No Hit
GGCGAGCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGGAGGGCGAGGA 14045 0.11528012866066735 No Hit
CACCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCCCCGG 13604 0.11166043932358267 No Hit
GGCAGAGACAGAGGCGGCGGCCCGGGGGATCCGGTACCCCCAAGGCACGC 13589 0.1115373206386478 No Hit
CGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGACGCCACGGGGA 13537 0.11111050919754033 No Hit
CCCGGCGCCGCGGCCACGGGCGCGGCCGGGCGGGCCGCGGGGCGGGCTCC 13513 0.11091351930164456 No Hit
CGGCGGGCGGCGGGCGGGGAAGAGGGCACAGACGGGCGAGGGCCGGGGAC 13502 0.11082323226602565 No Hit
GGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGC 13440 0.11031434170162827 No Hit
GGGCGGGCCGCGGGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCG 13134 0.10780272052895727 No Hit
GGGGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGA 13111 0.10761393854539049 No Hit
GGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAA 13017 0.10684239478646541 No Hit
CGGGAATCCGGCCGGCCCCGAAGACGGGGAGCCGGCGCGGCGGGGCCGGA 12869 0.10562762376177487 No Hit
GCTAGAGAAGGCTTTTCTCACCGAGGGTGGGTCACACTCCCCCCACCCGCCAGCCGCTCCTCCTCGGGCCCGC 12865 0.10559479211245891 No Hit
CCGGCGAGGGGGAGAGGCGACGGGAGAGAGAGCGCGCGGCCGACGGCACC 12712 0.10433898152612339 No Hit
GCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCGG 12229 0.10037455987122114 No Hit
GGGCCTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCGG 12192 0.10007086711504849 No Hit
I think if I remove all the overrepresented sequences I will improve the Per sequence GC content, but not sure if is the best option... any suggestion or advice?
Thanks!
In general, it is generally a good idea to try to figure out why you see the overrepresented sequences before removing them.
Apparently the run did overcluster due to library concentration... so in that case what is the best choice?
Is this NextSeq data? "Failing" a test does not automatically make the data bad. If those are bad basecalls due to run overclustering then you would get a lower fraction of reads aligning.
My suggestion is don't mess with the data beyond scanning for and getting rid of adapters/extraneous sequences. If the downstream analysis demonstrates a problem then you can backtrack to diagnose other issues.
Yes, is NextSeq data. Only one index has been reported in overrepresented sequences but adapter content pass (flat). However, lot of kmers has been reported. So I will do the alignment and run multiqc and see whats going on. Thanks!
Don't depend on FastQC to judge adapter contamination. It does not look at the entire data when reporting various stats (see below). Use a proper scan/trim program like
bbduk.sh
from BBMap.So I can't figure out how to scan and trim my fastq file with bbduk.sh, as is necessary to tell the sequences or kmers that I want to remove and they only appeared in fastqc report... maybe I miss something? thanks
Idea is you are not going to remove anything other than Illumina adapters, that is if they are present. If your data has really bad quality (Q10 or less) then and only then you may want to do quality based trimming/filtering (
trimq=10
). Otherwise something like this should suffice.If you have PE data
If your data is single-end
Once this is done, don't worry about
fastqc
. Just proceed with your analysis.bbduk.sh
should produce nice stats at the end of the run. Post then here if you want a second opinion.thank you very much for the info! Very useful!
Other naive question.... should I use also the splitnextera.sh ? thanks!
Not unless you have Nextera long mate pair libraries. Do you?
No for that pool of sequences, but I will have to analyze some DNA paired end with Nextera long mate pair libraries.
Then yes. See the guide in
bbmap/docs/guides/SplitNexteraGuide.txt
for that.