Hi all, I have a multi fasta file containing around 200 contig. I want to calculate the base composition, mainly GC content and length of each contig. Can anyone suggest me how to do it using awk or perl??
Thank you.
Hi all, I have a multi fasta file containing around 200 contig. I want to calculate the base composition, mainly GC content and length of each contig. Can anyone suggest me how to do it using awk or perl??
Thank you.
While you could write some Perl to do this, if you are only interested in some basic information about the sequences then using a existing tool such as the EMBOSS program infoseq is probably going to be easier. For example, getting the sequence length and GC composition:
$ infoseq -auto -only -accession -length -pgc em_rel_est_env
Accession Length %GC
AB446243 43 55.81
AB446244 174 59.20
AB446245 195 52.31
AB446246 205 61.46
AB446247 133 60.15
AB446248 106 62.26
AB446249 73 63.01
AB446250 216 57.41
...
While this example uses white-space padded columns, the '-nocolumns' and '-delimiter' options can be used to produce a delimited table for easier parsing, and the header line detailing the columns can be disabled using the '-noheading' option.
If you are interested in extracting other information from the sequences, as a staring point try looking at the other EMBOSS programs: http://emboss.open-bio.org/html/use/apbs02.html
From Perl you could use 'system()' to run an EMBOSS program externally, or you could use the EMBOSS support in BioPerl to run EMBOSS programs (see http://www.bioperl.org/wiki/HOWTO:Beginners#Using_EMBOSS_applications_with_Bioperl).
Alternativly you could use the sequence information support available in BioPerl, see http://www.bioperl.org/wiki/HOWTO:Beginners#Obtaining_basic_sequence_statistics.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
infoseq is working fine . But a small problem my sequence are like this
and i am getting result like
Could you please let me know how can i get name of each contig in place of accession number
From the 'infoseq' usage message (infoseq -help):
The default behaviour of 'infoseq' is to display all of the columns with the appropriate headers, so you can always use something like:
To see what all the columns look like, and then select the appropriate ones for your specific data.