Hi Everyone,
I am new to bioinformatics and I am trying to extract my bins (containing contigs) from a metagenome file. I have the following scripts but when I checked the fasta file for each bin, it doesn't match with the number and contigs IDs within the bin. Could you help me what's wrong with this script? It seems that it can read the contigs with 4 digits (example: Naka_1045) and the remaining contigs containing more than 4 were not included in the fasta file. I also notice that all that it included all the first 2 digits from my contig files (ex, all those that starts with 10 in the case of Naka_1045). I hope you could help me on this. Here's the script given to me:
#!/usr/bin/perl -w
$usage = "perl.exe $0 [filename.txt] [filename.fasta]\n";
unless (@ARGV == 2) {die $usage;}
print "This script puts contigs from a metagenome fasta file into buckets with entries of the same organism(-group).\n";
#13
open INFILE, "$ARGV[0]" or die "cannot open $ARGV[0]\n";
my %contig;
# my @organism;
my %bucket;
$n= 0;
while (<INFILE>)
{
@parameters= split;
$contig{$parameters[0]}= $parameters[1];
# $organism[$n]= $parameters[1];
#28
if (not exists $bucket{$contig{$parameters[0]}})
{
$bucket{$contig{$parameters[0]}}= 1;
}
else
{
$bucket{$contig{$parameters[0]}}= $bucket{$contig{$parameters[0]}} + 1;
}
$n ++;
}
#40
print "$n contigs are listed in $ARGV[0].\n";
print "The metagenome comprises the following groups:\n";
print "\n";
print "Group:\tEntries:\n";
#46
foreach (keys %bucket)
{
print "$_\t$bucket{$_}\n";
}
close INFILE;
open INFILE, "$ARGV[1]" or die "cannot open $ARGV[1]\n";
#57
my @entry;
my %sequence;
$m= 0;
while (<INFILE>)
{
@values= split;
if (exists $values[0])
{
$z= $values[0];
#69
if ($z=~ m/>/)
{
$entry[$m]= $z;
$sequence{$entry[$m]}= "";
# print "$entry[$m]\n";
$m++;
}
else
{
#79 print ".";
if (exists $sequence{$entry[$m - 1]})
{
$sequence{$entry[$m - 1]}= "$sequence{$entry[$m - 1]}$z\n";
# print "$sequence{$entry[$m - 1]}\n";
}
else
{
print "error\n";
}
}
#90
}
}
close INFILE;
$extract= "";
$strain= "";
$matched= 0;
foreach (keys %bucket)
{
$strain= $_;
open OUTFILE, ">$strain-$ARGV[1]" or die "cannot open $strain-$ARGV[1]\n";
#104
for ($a= 0; $a < $m; $a ++)
{
$extract= $entry[$a];
$extract= substr($extract, 1, 13);
#109
if (exists $contig{$extract})
{
if ($contig{$extract} eq $strain)
{
print OUTFILE "$entry[$a]\n";
print OUTFILE "$sequence{$entry[$a]}";
$matched ++;
}
}
}
close OUTFILE;
}
print "$ARGV[1] contains $m entries; $matched thereof match to $ARGV[0].\n";
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you for helping me. I just copied it from my file and pasted here. I apologize for not putting it correctly.
Hello mjoval and welcome to biostars,
it would be more helpful if you show us how your input files looks like and how your desired output should look like based on the example input you show us.
fin swimmer
example of input is missing.
My input file in txt contains the different contigs in 1st column and the bin number in the second column like the following:
I would like to have a separate fasta files of the bin# containing the contigs that belong to that bin#. (Ex. Bin#151- containing Contig Naka2016_4, Naka2016_6; Bin#146-containing Naka2016_3,Naka2016_7,Naka2016_14,Naka2016_19). Using the scripts would gave me that fasta file but some of the contigs doesn't correspond to the input list that I have. Thank you in advance for your help. I hope i have given the sample input correctly.