Hello all, I wrote a script below to analyze two file formats from bedtools (-tab
option, -name
option) so it could combine the headers if the sequences match. It works to an extent. The problem I ran into was that if the sequences matched to multiple names it only printed one of the names that correspond to it. I was wondering if anyone had a suggestion to how to approach this. As I want both the position of the sequence and name. Is there an option through bedtools?
My script stores both files into their own hashes and then loops through then if they are equal it is suppose to print out matches in sequences with the appropriate names
sample data: -name output bedtools
>sequence1
AGGT
>sequence2
AAAA
>sequence3
CCCC
sample data: -tab output bedtools
>Chr3
AAAA
>Chr4
ACCT
>Chr2
CCCC
output from script
>sequence2|Chr3
AAAA
>sequence3|Chr2
CCCC
my %sequence;
open(NAMES_FILE, $ARGV[0]) or die "Cannot open the file: $!";
my $hash_key_name;
my $hash_value_name;
while (my $line = <NAMES_FILE>) {
if ($line =~ /^>(\S+)/) {
$hash_key_name = $1;
}
elsif ($line =~ /\S/) {
chomp $line;
$hash_value_name = $line;
$sequence{$hash_key_name} = $hash_value_name;
}
}
my %sequence_2;
open (POSITIONS_FILE, $ARGV[1]) or die "Cannot open the file: $!";
my $hash_key_pos;
my $hash_value_pos;
while (my $line2 = <POSITIONS_FILE>) {
if ($line2 =~ /^>(\S+)/) {
$hash_key_pos = $1;
}
elsif ($line2 =~ /\S/) {
chomp $line2;
$hash_value_pos = $line2;
$sequence_2{$hash_key_pos} = $hash_value_pos;
}
}
foreach $hash_key_pos (keys %sequence_2) {
foreach $hash_key_name (keys %sequence) {
if ($sequence{$hash_key_name} eq $sequence_2{$hash_key_pos})
{
print ">$hash_key_name|$hash_key_pos\n$sequence{$hash_key_name}\n"
}
}
}
if i could also get help on how to format code when posting to biostars - that would be grand. Cheers.