Sequence extraction based on coordinates
1
0
Entering edit mode
7.5 years ago

I have a file with sequences in first column and start coordinate for these sequences in the 3rd column. I want to extract all rows where the starting loci of sequences vary by 5 nt of each other. Following is an example

  Sequence     Column header_xxx          Start coordinate           Column header_yyy      
    ABC                                           500
    XYZ                                           502
    DEF                                           12050
    PQR                                           12055
    abc                                           400
    mno                                           456

In the above example the script I am trying to come up with should be able to extract rows with seq ABC, XYZ, DEF and PQR, but not abc and mno.

Any solution preferably in bash or perl will be very helpful.

Thanks and regards

sequence • 1.4k views
ADD COMMENT
3
Entering edit mode
7.5 years ago

As a pre-requisite, install BEDOPS to get bedmap and sort-bed in your environment.

  1. Convert the (headerless) file to a pseudo-BED file:

    $ awk 'BEGIN{ OFS="\t"; }{ print "chrPseudo",$3,($3+1),$1 }' seq.txt | sort-bed - > seq.bed
    
  2. Use bedmap to map elements of seq.bed that are within 5 bases of each other and retrieve qualifying seq.bed ID values. Write them out on separate lines, using sort and uniq to build a text file of unique ID names, written to a file called seqIDs.txt:

    $ bedmap --echo-map-id --range 5 --multi-delim '\n' seq.bed | sort | uniq > seqIDs.txt
    
  3. Use grep and the seqIDs.txt file to filter the original seq.txt file:

    $ grep -f seqIDs.txt seq.txt > seqFiltered.txt
    

The file seqFiltered.txt should contain a subset of lines from seq.txt that meet your loci requirements.

Once you see what each step does, you can do all of this in a one-liner:

$ awk 'BEGIN{ OFS="\t"; }{ print "chrPseudo",$3,($3+1),$1 }' seq.txt | sort-bed - | bedmap --echo-map-id --range 5 --multi-delim '\n' - | sort | uniq | grep -f - seq.txt > seqFiltered.txt

This would run faster, by avoiding the creation of intermediate files.

ADD COMMENT
0
Entering edit mode

Thanks a lot for the solution Alex !!

ADD REPLY
0
Entering edit mode

If this answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLY

Login before adding your answer.

Traffic: 1684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6