Question

Genome assembly

0

Entering edit mode

3.0 years ago

sankkan • 0

Hello,

It is probably a very basic question, yet i struggle to find an answer to it. My lab ordered a whole genome sequence in a commercial firm, not long ago we received from them a few .fasta files with many short sequences in them. As i understand, i now need to map these short sequences to a reference genome to obtain one long sequence. Any ideas how i can do it? I tried a few software programs but it seems they all need .abi files not .fasta as an input.

assembly virus Genome fasta • 1.7k views

ADD COMMENT • link updated 3.0 years ago by Joe 21k • written 3.0 years ago by sankkan • 0

0

Entering edit mode

Mapping is just a one of a multi-step process of RNA-seq analysis. You also have to check the quality of your reads, trim the adaptors, etc. It is far from trivial.

Having said that - because you are looking for a mapping tool - check the software kallisto.

ADD REPLY • link 3.0 years ago by official.profile ▴ 20

0

Entering edit mode

Thank you for the reply! Its not RNA-seq, we are sequencing viral DNA.

ADD REPLY • link 3.0 years ago by sankkan • 0

0

Entering edit mode

Are you sure they are FASTA files and not FASTQ files?

Generally speaking you should be given the FASTQs to run assembly from (you are doing assembly, not "mapping" at this stage).

I would suggest using a tool like shovil to let it determine some sensible parameters for you (it will accept FASTQs).

ADD REPLY • link 3.0 years ago by Joe 21k

0

Entering edit mode

No, i have FASTA files. As i understood, these files have contigs in them. And one file has scaffolds. I guess that means that they did assembly of the reads for us. Also they sent us a table with info about assembly(below). Now i need to assemble these sequences into one, but just can't figure out how.

n     n:500    n:N50    min N80 N50 N20 E-size  max     sum     name

810 29  8   550 971 1783    3172    1949    3283    40634   K91

635 41  11  504 719 1770    2781    1775    3291    50600   K95

661 34  9   511 816 2078    3180    1912    3299    46070   K99                                

585 36  10  511 906 1795    2789    1886    3307    49244   K103

530 41  11  506 747 1783    2793    1821    3315    51616   K107

453 30  9   511 968 1787    2797    1924    3323    42359   K111

413 38  11  511 1020    1791    2802    1923    3465    53698   K115

392 34  9   512 1044    2323    3200    2034    3473    49440   K119

302 23  7   511 753 1799    2415    1866    3481    31446   K123

182 26  6   526 706 2331    3489    2215    4718    36379   K127

182 26  6   526 706 2331    3489    2215    4718    36379   scaffolds

ADD REPLY • link updated 3.0 years ago by GenoMax 147k • written 3.0 years ago by sankkan • 0

0

Entering edit mode

If they have already assembled these in to contigs and/or scaffolds that's the best you can do with the data you have.

Assuming this isn't what they already did with the scaffold files, you can merely order your contigs and join them with Ns via alignment to a reference genome. I don't know of the best tool for this these days though since you can do basically everything from contigs now, so will have to let others weigh in on that one.

You will not get a closed, complete, genome from this data without doing hybrid assembly with a long read technology (and a bit of luck).

I would also check that assembly file and sort it according to the N50, I don't know what your organism is but those N50s look awful to me, though that might be my inexperience with viruses showing through.

ADD REPLY • link 3.0 years ago by Joe 21k

score 0 · Answer 1 · 2021-11-30

0

Entering edit mode

3.0 years ago

Gregor Rot ▴ 540

Well, you would usually do de novo assembly from DNA, usually you would get FASTQ files (also contain qualities) and not only FASTA files (only sequence).

ADD COMMENT • link 3.0 years ago by Gregor Rot ▴ 540