Hello All, I am trying to take complete genome of any organism and give it as an input to my perl script which will generate reads of same length randomly from the input. Once reads are generated, i will give the reads as an input to the prototype assembler which i designed and try to assemble. Here, in my assembler, i'm not implementing to remove tips, bubbles from de bruijn graph which is constructed from the reads (Later on, i will try to code to remove tips, bubbles and etc.,)
For testing my assembler, First I downloaded the complete genome of Enterobacteria phage lambda from https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa and generated the random reads from the genome using my perl script. Then, run my assembler on the simulated reads. My assembler successfully assembled to the original genome from the simulated reads.
In the complete genome file of lambda virus https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa, I could see only one line starts with the character '>'.
For another experiment, I thought of taking a bigger complete genome of any organism and test with my assembler. So I downloaded the complete genome of Drosophila melanogaster from this file GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna file(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz).
Like Lambda virus complete genome file, So I expected even this file GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna will have only one line will start with the character '>'. But there are many lines start with '>'. Also, I could see many A,T,C,G are in lower case letters. Also, I could see many 'N's....
Now, I would like to know that why many lines start with the character '>', many A,T,C,G are in lower case letters and Many N's in the file ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz?
How to generate simulated reads from any organism's complete genome? or where i can freely download the simulated reads of any organism?
Thanks in advance.
This software is very poular for simulating reads.
Thanks for the answer