What is the format of *.contigs.fasta files?
0
0
Entering edit mode
8.2 years ago
Na Sed ▴ 310

Hi everyone,

I am given a file which its name is AA.contigs.fasta. The first lines of this file are like the below:

>tig00000000 len=1940327 reads=4609 covStat=3434.17 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
ATCTGCTTCATCCGCATCGAATCACGGGCACTCAGATGATCTCTAGGGCACGACCTAAACCCACCTGACGCGCCATACGAGATGCACCTCCGCCACAAGG
GAAGATGCCCATACCCACTTCCATCTGCATGAATTTGTATTTACCGCGAGCGGCAAAACGCATGTCACAGGCCAAGCAAATTCATGACCGCCACCACGGG
ACAAGCCTTCTAGCTTCGCAATTGTAGCTTGTGGTAGCTTGCTGATTCTTTCGAGAACCGCTTGAAGATCGAGTAACTTCGCTTCCTCGCGAGAAACGGC
CTCTGTCGACATCTCTTTAAGCAATTCGGTATCGTAATGACAAACCCAAATCTCCGGGTTGGCTGATTGGAATACAACCACTTTGACACTACGATCACGT
TCTAATCGCAGTGCTAACCCATTCAAATCCGCCAACATTTCCTGCCCTTGCACGTTTACCGTACCAAAATCAAACGTGACATAAAGAATTGCGTCTTCTT
GCTTTGCAGTAAACGTTTTGTAGCCTTCGTAAGCCATATCCATTTCCTTTTTCCAATAAAATCACTAGGTTGCTATTTTTCAAAGCAACGCAATTAACGT
TACGCCTCTAAAAAACATCAAACAATGACGCATAAAAAGAAACAGTATCTACGAAAACTAAAAGGTGATTTCCTCAATAACGGCTAGCAACAAATCACGT

1- Could you please tell me about the format of the file? For example, what is each row? How this file is obtained? What is the meaning of info in the first line?

2- By given this file, how can I calculate the total genomic length of the assembly?

3- Do you know any reference about this material? I am completely unfamiliar with this stuff and wanna learn.

Thank you.

next-gen contig fasta • 5.3k views
ADD COMMENT
1
Entering edit mode

Wiki is often a good place to start:

https://en.wikipedia.org/wiki/FASTA_format

ADD REPLY
0
Entering edit mode

What is the role of 'contigs' in the name of file? Also, I have only one file for each genome and the number of rows in this file is ~60,000 lines. All lines except the first line include A,C, G, and T.

ADD REPLY
0
Entering edit mode

Did you check the FASTA_format WikiPedia link provided by @Brian above.

  1. How this file was obtained is hard to say but if the contig in the file name means what it should then it was likely produced by a sequence assembly program.
  2. 1940327 is the length of the piece you posted above.

Number of lines/rows has no special meaning. The DNA sequence is a continuous string. It has likely been split across multiple lines ("rows" that you are referring to) for ease of display.

ADD REPLY
0
Entering edit mode

In one line description of the file, it has been written that it is de novo assembled genome. In this case, what is the number of contigs? does it equal to the number of rows?

ADD REPLY
0
Entering edit mode

The number of contigs is the number of headers. Each starts with a '>' symbol.

ADD REPLY
0
Entering edit mode

@Brian There is no '>' character in the file. Please see one of the file through the DropBox link: https://www.dropbox.com/s/2f1scrsgk0n2p4c/lbc26_ABCD.contigs.fasta?dl=0

ADD REPLY
0
Entering edit mode

If there is no '>' character it is not a fasta file. Your example in the first post starts with >.

ADD REPLY

Login before adding your answer.

Traffic: 1866 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6