Could someone point me out to fastq short sequence reads and its corresponding assembled fasta file for learning assembly from sequencing reads? Data can be anything from human to insect or plant. I am specially not looking for huge data. Thanks
Could someone point me out to fastq short sequence reads and its corresponding assembled fasta file for learning assembly from sequencing reads? Data can be anything from human to insect or plant. I am specially not looking for huge data. Thanks
Only few people submit their reads as well as their assemblies. SAMN04994921 is a nice example where both, a set of Illumina reads and a set of 25 contigs are available.
https://www.ncbi.nlm.nih.gov/biosample/SAMN04994921
There are 41 contigs in this file. ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/580/525/GCF_003580525.1_ASM358052v1/GCF_003580525.1_ASM358052v1_genomic.fna.gz
What does it mean? As per my understanding, we create one fasta file containing full genome from pair-end reads.But the above files has 41 contigs. Correct me if I am wrong.
I was expecting to get a scaffold from sequencing reads. I am using SPAdes for assembly. I have fetched sequencing reads SRR3528286_1.fastq and SRR3528286_2.fastq and ran SPAdes on these two reads. This gave me one scaffolds.fasta file which is just a single string of bases intermittently containing N's. Now I want to compare this fasta file with the one assembled in the project SAMN04994921. But since that fasta file has 41 contigs, I cannot compare them.
Ah, ok I see.
Yes that can happen, you're not required to submit your scaffolds, the contigs are the minimum requirement. Is there any option you can run SPAdes up to the contig part (omitting the scaffolding step)? I can also hardly imagine the SPAdes will output a single scaffold for this assembly.
Is there really only a single sequence in the SPAdes output (=weird) or is it just a single fasta file (=to be expected)
That is very well possible. If you use a different assembler you will get a different result. Many other things might also be in play: they might also have filtered out some contigs, used different parameters, filtering of input data, ... To kinda mimick what they have done you should read up their methods and apply those as well (except for the assembler software then that is).
Personally I would also not simply compare the assemblies on length but rather on 'content' (== compare the actual sequence itself) . There is software around that can do that.
There are 47 tools for assembly evaluation in Omictools.
Also, an interesting one comparing the reconstructed LTR by different assembler: Assessing genome assembly quality using the LTR Assembly Index (LAI)
For studying purposes you can try these UGENE NGS tutorials: https://goo.gl/4Kspho or https://goo.gl/cxHCAU.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Then you are probably looking for bacterial genomes.
Could you please provide me link to such data set?
Next question is whether you are looking for short Illumina read or long PacBio/nanopore reads...
I have gone through this course https://genomics.sschmeier.com/index.html
I used their data to implement the workflow. Now I want some other data sets to start with the assembly.
I am looking for short reads like 150bp long and their corresponding fasta file so that I can compare both fasta files(one given and other assembled by me from sequencing reads).
Yes, I am looking for short Illumina reads.
if you download (or plan to use) a certain software there are usually some test datasets provided with it, to test and try out the software
I have gone through this course https://genomics.sschmeier.com/index.html
I used their data to implement the workflow. Now I want some other data sets to start with the assembly.
OK, so does that mean you're going for SPAdes?
if you want to get other data go and have a look at SRA (NCBI) or ENA (EBI) , they have usable interfaces to query the data you want
EDIT: ah, you want to end result as well, then you better first query for a genome assembly submission and then link trough to get to the actual data associated with it
Or alternatively, look for a publication about "de novo assembly of the...", which should contain links or accession ids for raw data and the assembled sequence
Guys,I am completely new to this field. I really appreciate if you pin point me to some bacterial genome sequencing data and its corresponding data. Thanks!