Extract subset sequences from fasta file
3
0
Entering edit mode
9.5 years ago
zhwjmch ▴ 170

I would like to extract a subset of fasta files based on the following two criteria: 1) seq length, 2) loc=2R and loc=2L; both of the information are indicated in their ID lines. Thanks in advance :)

Example: two sequences are listed below, I would like to extract all length longer than 300 AND loc=2R OR loc=2L

>FBtr0342981 type=untranslated_region; loc=2R:join(22136968..22137026,22138000..22138251,22162905..22162919);MD5=594e04c2c432646c24f2e6e22febd982; length=326; parent=FBgn0000008; release=r6.06; species=Dmel;
GTTCAATCTTTGTTTTCGTAGCGCGGC...GAAGTGAAGTAACCCATAAAACTAAC
>FBtr0083388 type=untranslated_region; loc=3R:complement(16828004..16828105);MD5=76d226b07edc8a389bce4902b3847b7a; length=102; parent=FBgn0000014; release=r6.06; species=Dmel;
GTTCAATCTTTGTTTTCGTAGCGCGGC...GAAGTGAAGTAACCCATAAAACTAAC
sequence • 4.4k views
ADD COMMENT
1
Entering edit mode
  1. iterate through the lines in the file
  2. if the line starts with > check if length > 300
    • if it is, check if loc=2R or loc=2L and print until the next sequence header >
    • if not, do nothing
ADD REPLY
1
Entering edit mode
9.5 years ago

The BBMap package has a tool that will accomplish the first part:

reformat.sh in=file.fa out=filtered.fa minlen=300

There's another tool that would usually accomplish the second part, but unfortunately the = symbol is reserved. But judging by the headers, this would still work:

filterbyname.sh in=filtered.fa out=filtered2.fa substring=name names=2R,2L

That will give you things containing 2R or 2L, which is I think what you want, rather than 2R and 2L.

ADD COMMENT
0
Entering edit mode

Sure, 2R or 2L, :) Thanks for recommend BBMap package.

ADD REPLY
0
Entering edit mode
9.5 years ago
tomc ▴ 90

Who knows, it might work beyond your sample

#! /usr/bin/awk -f
BEGIN {FS=";"}

/\>.*; loc=2[LR];.*/
{
    for(i=1;i<=NF;i++){
        split($i,a,"=");
        defl[a[1]]=a[2]
    }
    if(300<defl[" length"]){
        record=$0;
        while(getline){
            if(NF<2){record = record "\n" $0}
            else{print record "\n"; NL--;break}
        }
    }
    delete defl;delete a; record=""
}
ADD COMMENT
0
Entering edit mode
9.2 years ago

If your sequence lengths match the description in the defline, you can do:

$ pip install pyfaidx
$ faidx --size-range 300,1000000 --regex loc=2[RL] file.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2501 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6