Question

Sequence reads and complete assembly

0

Entering edit mode

6.6 years ago

jeetsahu ▴ 10

Could someone point me out to fastq short sequence reads and its corresponding assembled fasta file for learning assembly from sequencing reads? Data can be anything from human to insect or plant. I am specially not looking for huge data. Thanks

assembly sequencing sequence • 4.0k views

ADD COMMENT • link updated 6.6 years ago by oigl ▴ 60 • written 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

I am specially not looking for huge data.

Then you are probably looking for bacterial genomes.

ADD REPLY • link 6.6 years ago by WouterDeCoster 48k

0

Entering edit mode

Could you please provide me link to such data set?

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

Next question is whether you are looking for short Illumina read or long PacBio/nanopore reads...

ADD REPLY • link 6.6 years ago by WouterDeCoster 48k

0

Entering edit mode

I have gone through this course https://genomics.sschmeier.com/index.html

I used their data to implement the workflow. Now I want some other data sets to start with the assembly.

I am looking for short reads like 150bp long and their corresponding fasta file so that I can compare both fasta files(one given and other assembled by me from sequencing reads).

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

Yes, I am looking for short Illumina reads.

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

if you download (or plan to use) a certain software there are usually some test datasets provided with it, to test and try out the software

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

I have gone through this course https://genomics.sschmeier.com/index.html

I used their data to implement the workflow. Now I want some other data sets to start with the assembly.

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

OK, so does that mean you're going for SPAdes?

if you want to get other data go and have a look at SRA (NCBI) or ENA (EBI) , they have usable interfaces to query the data you want

EDIT: ah, you want to end result as well, then you better first query for a genome assembly submission and then link trough to get to the actual data associated with it

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

3

Entering edit mode

Or alternatively, look for a publication about "de novo assembly of the...", which should contain links or accession ids for raw data and the assembled sequence

ADD REPLY • link 6.6 years ago by WouterDeCoster 48k

0

Entering edit mode

Guys,I am completely new to this field. I really appreciate if you pin point me to some bacterial genome sequencing data and its corresponding data. Thanks!

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

6.6 years ago

oigl ▴ 60

For studying purposes you can try these UGENE NGS tutorials: https://goo.gl/4Kspho or https://goo.gl/cxHCAU.

ADD COMMENT • link 6.6 years ago by oigl ▴ 60

score 3 · Accepted Answer · 2018-10-26

3

Entering edit mode

6.6 years ago

piet ★ 1.9k

Only few people submit their reads as well as their assemblies. SAMN04994921 is a nice example where both, a set of Illumina reads and a set of 25 contigs are available.

https://www.ncbi.nlm.nih.gov/biosample/SAMN04994921

https://www.ncbi.nlm.nih.gov/sra/SRR3528286

https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=LXWH01#contigs

ADD COMMENT • link 6.6 years ago by piet ★ 1.9k

0

Entering edit mode

Thanks, I will look into it.

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

There are 41 contigs in this file. ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/580/525/GCF_003580525.1_ASM358052v1/GCF_003580525.1_ASM358052v1_genomic.fna.gz

What does it mean? As per my understanding, we create one fasta file containing full genome from pair-end reads.But the above files has 41 contigs. Correct me if I am wrong.

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

yes, so?

This simply means they were able to assemble the genome into 41 contigs. What kind of result/data had you hoped for?

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

I was expecting to get a scaffold from sequencing reads. I am using SPAdes for assembly. I have fetched sequencing reads SRR3528286_1.fastq and SRR3528286_2.fastq and ran SPAdes on these two reads. This gave me one scaffolds.fasta file which is just a single string of bases intermittently containing N's. Now I want to compare this fasta file with the one assembled in the project SAMN04994921. But since that fasta file has 41 contigs, I cannot compare them.

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

Ah, ok I see.

Yes that can happen, you're not required to submit your scaffolds, the contigs are the minimum requirement. Is there any option you can run SPAdes up to the contig part (omitting the scaffolding step)? I can also hardly imagine the SPAdes will output a single scaffold for this assembly.

Is there really only a single sequence in the SPAdes output (=weird) or is it just a single fasta file (=to be expected)

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

I will explore if SPAdes has contig option. By default SPAdes give just a single fasta file as output. Do you know of any project which have submitted both fasta file and sequencing reads in ncbi?

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

I assume the one piet mentioned is one like that?

How many sequences are there in the SPAdes output file? ( grep -c '>' <fasta-file> )

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

I am using the same one mentioned by piet.

There are 257 sequences in output file.

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

0

Entering edit mode

The given fasta file with 41 contigs is 29,305bp shorter than the one obtained by running SPAdes.

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

1

Entering edit mode

That is very well possible. If you use a different assembler you will get a different result. Many other things might also be in play: they might also have filtered out some contigs, used different parameters, filtering of input data, ... To kinda mimick what they have done you should read up their methods and apply those as well (except for the assembler software then that is).

Personally I would also not simply compare the assemblies on length but rather on 'content' (== compare the actual sequence itself) . There is software around that can do that.

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

How to make sure that the assembly produced by one assembler is the correct one? Two different assembler can produce different assembly. Is there any criteria which can make sure that two different assemblies are similar?

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

1

Entering edit mode

nice topic for a new thread

and if you figure out the answer, let us know as this is probably the million dollar question in the assembly field ;)

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

Hahaha... It's still a open question then.

You mentioned about a software to compare assemblies. Which software is that?

ADD REPLY • link 6.6 years ago by jeetsahu ▴ 10

1

Entering edit mode

QUAST. You could also use BUSCO.

ADD REPLY • link 6.6 years ago by GenoMax 151k

1

Entering edit mode

There are 47 tools for assembly evaluation in Omictools.
Also, an interesting one comparing the reconstructed LTR by different assembler: Assessing genome assembly quality using the LTR Assembly Index (LAI)

ADD REPLY • link 6.6 years ago by AK ★ 2.2k