I am working on these files. I don't understand if these reads are the complete genome reads or are there any other reads that belong to this particular sample?
My understanding is that _1 and _2 = entire genome of that particular bacteria from a single sample.
How do I go about my assembly now ? because I know I am missing reads since the sequence length is only 500Kbp where as S. aureus should be 2.7Mbp.
It takes 29 seconds to assemble this genome (20 CPUs) with the following statistics:
135 contigs, total 2821177 bp, min 200 bp, max 404505 bp, avg 20897 bp, N50 109762 bp
After removing contigs < 2000 bp, it ends up with 58 contigs and 2788979 bp. That seems to be exactly as expected, so I think something in your procedure wasn't done right.
If you want to reproduce what I did, go to this website:
I think you might be getting stuck on less relevant parts of my exercise. The most important point was that nothing is wrong with the data.
Clipped fastq means that the adapters have been removed. Yes, both forward and reverse reads will be interleaved in the same file if you download them the way I suggested.
It is common to remove really small contigs, though you may want to lower the threshold to 1000 bp since this is a single-genome assembly. There isn't going to be much information in smallest contigs (200 bp) because those contigs can't have even a single complete gene.
Also, Do you think i can reproduce at least most part of the data from the paper just on my laptop? It has 4 logical processors (Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz 2.90 GHz), 8 GB RAM.
I was asked to work on an MRSA bioinformatics project.
Only way to find out is to try. It may work but if it is not going to then you will find that out quick (process would likely crash because of memory since 8GB may not be enough).
Did you download the complete dataset available from ENA/NCBI SRA? This is an older dataset (from 2012) with a total of 1146212 reads and 150153772 bases. This is a paired end dataset meaning the library fragments were sequenced from both ends. These reads should still be 55x coverage of the 2.7Mbp Staph genome.
135 contigs, total 2821177 bp, min 200 bp, max 404505 bp, avg 20897 bp, N50 109762 bp
So this is the same dataset except that you downloaded individual reads while in my earlier suggestion they were interleaved. That shouldn't affect the assembly except to give a slightly different command, and indeed it doesn't.
I don't think you need to worry about removing adapters.
You can use the default adapters.fa file included in the resources folder of BBMap suite (program to use is bbduk.sh) or a program like fastp can automatically identify adapters and trim them.
ok, i'll try to use fastp. I have only used trimmomatic and cutadapt till now and they dont identify on its own. Except the graph in fastqc says that it's Nextera which idk if i should trust
OK. Problem solved. The problem was from my end. Thank you both for helping !!