Question

Extract According To Row

0

Entering edit mode

11.8 years ago

2011101101 ▴ 110

I have a query document in fasta format. I want to extract some sequences, according to a row from another document. The two document are large.

For example. the query document is like below.enter link description here

>1
AAAAAAAAAAAAACAGTTGGCATG
>2
AAAAAAAAAAAAACCGAGTACCGTTCACGCC
>3
AAAAAAAAAAAAACCTTGAAC

The other document is like below.

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAAGCTCGGAT    1    0    0    0    0    0    0    0
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
AAAAAAAAAAAAACGGGATTC    0    0    0    0    1    0    0    0
AAAAAAAAAAAAACTCAGTTCTGCCT    0    0    0    0    0    1    0    0

The expected result is the following:

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1

• 3.7k views

ADD COMMENT • link updated 11.8 years ago by Pierre Lindenbaum 164k • written 11.8 years ago by 2011101101 ▴ 110

2

Entering edit mode

It would be nice if you could try out suggested solutions and let us know which one performed best? I would be interested to see time comparison of solutions offered by Poe and Pierre Lindenbaum. Thanks

ADD REPLY • link 11.8 years ago by zx8754 12k

score 3 · Answer 1 · 2013-02-22

3

Entering edit mode

11.8 years ago

PoGibas 5.1k

head -n 1 another_document > result && grep -v '>' query_document | grep -F -f - another_document >> result

ADD COMMENT • link 11.8 years ago by PoGibas 5.1k

0

Entering edit mode

grep -F -f - another_document ？

ADD REPLY • link 11.8 years ago by 2011101101 ▴ 110

1

Entering edit mode

grep -F -f to extract patterns between documents - http://stackoverflow.com/a/11490467/1286528.
- is piped pattern that you want to extract (motif sequences).

ADD REPLY • link 11.8 years ago by PoGibas 5.1k

score 2 · Answer 2 · 2013-02-22

2

Entering edit mode

11.8 years ago

Pierre Lindenbaum 164k

head -n 1 other.tsv > result
sort -t '    ' -k1,1 other.tsv | join  -t '    ' -1 1 -2 1 <(grep -v ">" file.fa | sort -u ) - >> result

ADD COMMENT • link 11.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

+1 I was just thinking of using Join as well, but then I saw your answer :)

ADD REPLY • link 11.8 years ago by zx8754 12k

score 1 · Answer 3 · 2013-02-22

1

Entering edit mode

11.8 years ago

Alex Reynolds 36k

Put the FASTA sequences into a hash table, and print out rows from the matrix file if the motif field element is defined in the hash table:

#!/usr/bin/env perl

use strict;
use warnings;

my $fastaFn = $ARGV[0];
my $masterFn = $ARGV[1];

my $seqsRef;
my $header;
my $sequence;

open FASTA, "< $fastaFn" or die "could not open FASTA file\n";
while (<FASTA>) {
    chomp;
    if (/>/) {
        $header = $_;
        $header =~ s/^>//;
    }
    else {
        $sequence = $_;
        $seqsRef->{$sequence} = $header;
    }
}
close FASTA;

open MASTER, "< $masterFn" or die "could not open master file for filtering\n";
my $ln = <MASTER>;
print STDOUT "$ln\n";
while (<MASTER>) {
    chomp;
    my @elems = split("\t", $_);
    my $motif = $elems[0];
    if (defined $seqsRef->{$motif}) {
        print STDOUT "$_\n";
    }
}
close MASTER;

To use it:

$ filter.pl myQuerySeqs.fa myDataMatrix.mtx > myFilteredMatrix.mtx

The file myQuerySeqs.fa is your FASTA file. The myDataMatrix.mtx file is the "master" matrix file that you want to filter on sequences from the FASTA file. Output is sent to myFilteredMatrix.mtx.

This should be fairly fast, if memory-intensive, because hash table lookups are in constant time.

ADD COMMENT • link 11.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Because my document is very large,how to get the myDataMatrix.mtx?

ADD REPLY • link 11.8 years ago by 2011101101 ▴ 110

0

Entering edit mode

You already have myDataMatrix.mtx (at least, if I understand your original question correctly).

ADD REPLY • link 11.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Yes,I understand ,thank you

ADD REPLY • link 11.8 years ago by 2011101101 ▴ 110

0

Entering edit mode

Don't forget to accept your answer when when you find the right solution for you

ADD REPLY • link 11.8 years ago by Agatha ▴ 350

score 0 · Answer 4 · 2013-02-21

0

Entering edit mode

11.8 years ago

Whetting ★ 1.6k

Not the most elegant, but this should do it (unless your Fasta file is too big for memory?)

from Bio import SeqIO
tags=[]
for seq_record in SeqIO.parse("in.fas", "fasta"):
    if str(seq_record.seq) not in tags:
        tags.append(str(seq_record.seq))


for tag in tags:
    with open("2.txt","rU") as f:
        for line in f:
            line=line.rstrip()
            if tag in line:
                print line

ADD COMMENT • link 11.8 years ago by Whetting ★ 1.6k

0

Entering edit mode

how to use it ?

ADD REPLY • link 11.8 years ago by 2011101101 ▴ 110

0

Entering edit mode

save the file as "rows.py" run in from terminal as "python rows.py"

ADD REPLY • link 11.8 years ago by Whetting ★ 1.6k

score 0 · Answer 5 · 2013-02-22

0

Entering edit mode

11.8 years ago

Agatha ▴ 350

I am not sure how big are your files but if R can handle them, then you can use :

require("Biostrings")
sequence_data<<-read.DNAStringSet("file1.fasta")
motifs<-read.table("file2.txt",header=T)
tab3<-subset(motifs, motifs$motif%in%as.character(sequence_data))

> tab3
         motif MZQ1 MZQ3 MZQ4 MZQ5 MZQ6 MZQ7 MZQ8 MZQ2
2        AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
3 AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
4           AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1

ADD COMMENT • link 11.8 years ago by Agatha ▴ 350

0

Entering edit mode

We can also use merge instead of subset: http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right

ADD REPLY • link 11.8 years ago by zx8754 12k

1

Entering edit mode

That is true, if sequencedata ( DNAStringSet object) would be converted to a dataframe...

ADD REPLY • link 11.8 years ago by Agatha ▴ 350

score 0 · Answer 6 · 2013-02-22

 #!/usr/bin/perl -w
print"Enter REFERENCE file: ";
chomp($file=<STDIN>);
open(FH,$file);
@org_det=<FH>;
print"Enter QUERY file: ";
$hspfile=<STDIN>;
open(FH1,$hspfile);
@hsporg=<FH1>;
print "enter output file : ";
$OUT = <STDIN>;
chomp($OUT);
open(OUT1,">$OUT");

foreach(@hsporg)
{
    @org=split('\t',$_);
    chomp($org=$_);
    foreach(@org_det)
    {
        @orginfo=split('\t',$_);
        if($org=~$orginfo[0])

        {
            print OUT1 "$_";
        }
            }
    }
close FH;
close FH1;
close OUT1;