Entering edit mode
2.5 years ago
Alewa
▴
170
I'm confused how the contig length is calculated. should it not be equal to the total number of nucleotides in the Mycobacterium tuberculosis genome? seems that's not the case given the demo below
Here's example
Give me snippet of Mycobacterium tuberculosis reference genome
$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efetch -format fasta | head
>NC_000962.3 Mycobacterium tuberculosis H37Rv, complete genome
TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCCGAACTTAACGGCGACC
CTAAGGTTGACGACGGACCCAGCAGTGATGCTAATCTCAGCGCTCCGCTGACCCCTCAGCAAAGGGCTTG
GCTCAATCTCGTCCAGCCATTGACCATCGTCGAGGGGTTTGCTCTGTTATCCGTGCCGAGCAGCTTTGTC
CAAAACGAAATCGAGCGCCATCTGCGGGCCCCGATTACCGACGCTCTCAGCCGCCGACTCGGACATCAGA
TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCGACGACACTACCGTGCCGCCTTCCGA
AAATCCTGCTACCACATCGCCAGACACCACAACCGACAACGACGAGATTGATGACAGCGCTGCGGCACGG
GGCGATAACCAGCACAGTTGGCCAAGTTACTTCACCGAGCGCCCGCACAATACCGATTCCGCTACCGCTG
GCGTAACCAGCCTTAACCGTCGCTACACCTTTGATACGTTCGTTATCGGCGCCTCCAACCGGTTCGCGCA
CGCCGCCGCCTTGGCGATCGCAGAAGCACCCGCCCGCGCTTACAACCCCCTGTTCATCTGGGGCGAGTCC
how many characters (Nucleotides) are in the reference genome
$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efetch -format fasta | grep -v "^>" | wc -c
4474554
To check contig lemgth, create fasta dictionary and open it to check contig length
$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efetch -format fasta > tb_Ref_NC_000962_3.fasta
$ samtools faidx /local/workdir/sam/results/vcf2fasta/tb_Ref_NC_000962_3.fasta
$ cat /local/workdir/sam/results/vcf2fasta/tb_Ref_NC_000962_3.dict
@HD VN:1.6
@SQ SN:NC_000962.3 LN:4411532 M5:57b12ff2773c5fd3a0f879972f176e50 UR:file:/local/workdir/sam/results/vcf2fasta/tb_Ref_NC_000962_3.fasta
that clarifies! thanks GenoMax