Finding Fasta Coordinates That Do Not Overlap
3
2
Entering edit mode
12.5 years ago
ashtx ▴ 30

Hi all, I am very sorry if this question is very trivial. I have a list of fasta coordinates for the multiple sequence fasta file. For example

Id            start position       End position
1398         4                             8
1398         5                             10
1398         12                            15 
1756         25                            30
1756         28                            35

Is it possible to convert the top table as given below?

ID          start Position        End position
1398        4                               10
1398        12                              15
1756        25                              35

Here is my actual part of the raw data

So basically I am looking for a way that gives me non-overlapping fasta coordinates from the over lapping coordinates. I saw a very similar question on this link : http://www.biostars.org/post/show/7825/how-to-get-non-overlapping-coordinates-from-a-list-that-contains-overlapping-coordinates/ but it does not help in my case as I have unique IDs instead of chromosome numbers. I tried to convert it to bed format using sortBed but it seems to give me this error "Error: malformed GFF entry at line 1. Start was greater than end. Exiting"

I got these columns from Blast results which are based on the contig assembly. Now I want to form a sequence based on those blast results. I hope this question is clear.

fasta • 5.2k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

I tried but it should not report/output sub-region within the region. It means that, we still get region 5-10 even if it covered under region 4-12. According to his solution, it should take that into account but it does not seem so. I might be missing some tricks.

ADD REPLY
0
Entering edit mode

Could you give an example (editing your question for instance) of what output you get with Benm's solution? I get exactly the solution you describe in your question...

ADD REPLY
0
Entering edit mode

Benm's solution works for some coordinates but then on some parts it does not work. Here is the input file http://pastie.org/4221091 Here is the output file http://pastie.org/4221095

ADD REPLY
0
Entering edit mode

I think I understood. I edited my code and added if ($data[2] < $new[2]) . I think now it should work on your sorted file and will not report so called underlaps. Please have a look.

ADD REPLY
0
Entering edit mode

Vikas, your solution works for some coordinates just as Benm's solution. Please see my input and output file in the previous reply. Thanks again.

ADD REPLY
1
Entering edit mode

There are 2 problems in your input file because of which you are not getting desired results with my code. One is that the file is not sorted and for that I have already mentioned to use sort command on first 2 columns. Second is that you have negative strands also. Do you want to treat them separately or you want to merge them also.

Eg- If you have

829   4    10
829   9     3

You want to merge them or want to keep them separately?

ADD REPLY
0
Entering edit mode

I want to merge them.

ADD REPLY
1
Entering edit mode

Please see my edit and let me know if it works.

ADD REPLY
0
Entering edit mode

It works pretty well. Thank you.

ADD REPLY
0
Entering edit mode

My textfile does not contain any header line. I am using it to illustrate my question. Sorry about the confusion.

ADD REPLY
7
Entering edit mode
12.5 years ago

If you create a tab-delimited file from your first table, you can use the following R code to create a tab-delimited result file that looks like your second table:

library("GenomicRanges")
library("IRanges")

datadir="/your/data/dir/"
setwd(datadir)

#Import regions to exclude
Data2Mergefile="Data2Merge.txt"
Data2Merge=read.table(file=Data2Mergefile, sep="\t", header=TRUE)

#Create GenomicRanges object for manipulation
ranges2merge = GRanges(seqnames = Rle(Data2Merge[,"Id"]),
ranges = IRanges(Data2Merge[,"start position"], end = Data2Merge[,"End position"]), 
strand = Rle(strand("*"), length(rownames(Data2Merge) ) ) )

#Simplify ranges to merge overlaps
reduced=reduce(ranges2merge)

#Reformat and print to file
results=data.frame(Id=as.vector(seqnames(reduced)), Start=start(reduced), End=end(reduced))
write.table(results, file="merged.txt", sep="\t", row.names=FALSE, quote=FALSE)
ADD COMMENT
0
Entering edit mode

This works but I had some coordinates in reverse direction(i.e end position having lower numerical value than the start position) and I got this error, "Error in .Call2("solveuserSEW0", start, end, width, PACKAGE = "IRanges") : solving row 1: negative widths are not allowed".

To get around this problem, I sorted my file such that start position had lower numerical value than the end position. Thanks a lot.

ADD REPLY
0
Entering edit mode

Yes. by convention it is best if coordinates are specified in same direction unless you have a concept of +/- strand for your data. Then I believe + strand coordinates could be forward and - strand in reverse and GenomicRanges/IRanges would know what to do with it.

ADD REPLY
1
Entering edit mode
12.5 years ago
Vikas Bansal ★ 2.4k

EDIT - After OP's input file and comments.

First we will swap the start and end position, if start position is greater than end position and then we will sort this file.

Say this perl script is strand.pl -

use strict;
use warnings;
open my $file, '<',$ARGV[0] or die "Unable to open input file: $!";
my @data;

while (<$file>) {
@data = split;
  if ($data[1] > $data[2]) {

    my $temp = $data[1];
    $data[1] = $data[2];
    $data[2] = $temp;
  }

  print join("\t",@data),"\n";
}

Use -

perl strand.pl input.file |  sort -k1,1n -k2,2n > sorted.file

Now we will merge the overlapping positions.

Then use this perl script say merge.pl -

use strict;
use warnings;
open my $file, '<',$ARGV[0] or die "Unable to open input file: $!";
my @data;

while (<$file>) {

  if (not @data) {
    @data = split;
    next;
  }

  my @new = split;

  if ($new[0] == $data[0] and $new[1] <= $data[2] + 1) {

     if ($data[2] < $new[2]){
     $data[2] = $new[2];
      }
  }
  else {
    print join("\t", @data), "\n";
    @data = @new;
  }

  print join("\t", @data), "\n" if eof $file;

}

Use -

perl merge.pl sorted.file > result.file
ADD COMMENT
1
Entering edit mode

Sorry it does not work but I would love to see the solution using Perl.

ADD REPLY
0
Entering edit mode

Can you please tell me, what error did you get while running Perl script on your sorted file?

ADD REPLY
0
Entering edit mode

I have edited my code. There was error while pasting the code. I don't know in "while" condition "DATA" became "[HTML]". But now I have changed it to file.

ADD REPLY
0
Entering edit mode

Now the code works but it does not solve my problem of getting rid of underlaps. For example if the coordinates for a particular ID are 1to147, it should get rid of other coordinates that are from 20-50 because I have already have those covered in 1-147. Thanks.

ADD REPLY
1
Entering edit mode
12.5 years ago
Rm 8.3k

if I understood your Question correctly: 1.Add hash to the header line.

2.check to see if start positions ($2) is less than End position($3):

you can do that simply by using awk:

awk ' BEGIN {OFS = FS = "\t"};{if($2 < $3)print ; else print $1, $3, $2}'
ADD COMMENT

Login before adding your answer.

Traffic: 1308 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6