Entering edit mode
8.2 years ago
Na Sed
▴
310
Hi everyone,
I am given a file which its name is AA.contigs.fasta. The first lines of this file are like the below:
>tig00000000 len=1940327 reads=4609 covStat=3434.17 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
ATCTGCTTCATCCGCATCGAATCACGGGCACTCAGATGATCTCTAGGGCACGACCTAAACCCACCTGACGCGCCATACGAGATGCACCTCCGCCACAAGG
GAAGATGCCCATACCCACTTCCATCTGCATGAATTTGTATTTACCGCGAGCGGCAAAACGCATGTCACAGGCCAAGCAAATTCATGACCGCCACCACGGG
ACAAGCCTTCTAGCTTCGCAATTGTAGCTTGTGGTAGCTTGCTGATTCTTTCGAGAACCGCTTGAAGATCGAGTAACTTCGCTTCCTCGCGAGAAACGGC
CTCTGTCGACATCTCTTTAAGCAATTCGGTATCGTAATGACAAACCCAAATCTCCGGGTTGGCTGATTGGAATACAACCACTTTGACACTACGATCACGT
TCTAATCGCAGTGCTAACCCATTCAAATCCGCCAACATTTCCTGCCCTTGCACGTTTACCGTACCAAAATCAAACGTGACATAAAGAATTGCGTCTTCTT
GCTTTGCAGTAAACGTTTTGTAGCCTTCGTAAGCCATATCCATTTCCTTTTTCCAATAAAATCACTAGGTTGCTATTTTTCAAAGCAACGCAATTAACGT
TACGCCTCTAAAAAACATCAAACAATGACGCATAAAAAGAAACAGTATCTACGAAAACTAAAAGGTGATTTCCTCAATAACGGCTAGCAACAAATCACGT
1- Could you please tell me about the format of the file? For example, what is each row? How this file is obtained? What is the meaning of info in the first line?
2- By given this file, how can I calculate the total genomic length of the assembly?
3- Do you know any reference about this material? I am completely unfamiliar with this stuff and wanna learn.
Thank you.
Wiki is often a good place to start:
https://en.wikipedia.org/wiki/FASTA_format
What is the role of 'contigs' in the name of file? Also, I have only one file for each genome and the number of rows in this file is ~60,000 lines. All lines except the first line include A,C, G, and T.
Did you check the FASTA_format WikiPedia link provided by @Brian above.
Number of lines/rows has no special meaning. The DNA sequence is a continuous string. It has likely been split across multiple lines ("rows" that you are referring to) for ease of display.
In one line description of the file, it has been written that it is de novo assembled genome. In this case, what is the number of contigs? does it equal to the number of rows?
The number of contigs is the number of headers. Each starts with a '>' symbol.
@Brian There is no '>' character in the file. Please see one of the file through the DropBox link: https://www.dropbox.com/s/2f1scrsgk0n2p4c/lbc26_ABCD.contigs.fasta?dl=0
If there is no '>' character it is not a fasta file. Your example in the first post starts with >.