Question

How Can I Convert A Bed File Into A Tab File With Paired End Reads On The Same Row?

1

Entering edit mode

12.8 years ago

Luke ▴ 240

Hi guys!

I have a file like this one (obtained from bed file using awk):

scaffold00002b  209798  209823  HWUSI-EAS1825_0024_FC:8:1:6009:1105
scaffold00002b  209802  209838  HWUSI-EAS1825_0024_FC:8:1:6009:1105  
scaffold00002d  43627   43652   HWUSI-EAS1825_0024_FC:8:1:8703:1105
scaffold00008e  22741   22767   HWUSI-EAS1825_0024_FC:8:1:14128:1104
scaffold00008e  22740   22768   HWUSI-EAS1825_0024_FC:8:1:14128:1104

(note that the rows 1-2 and 4-5 have the same record in the 4th field).

I wish to convert it to a tab file like this one:

HWUSI-EAS1825_0024_FC:8:1:6009:1105  scaffold00002b  209798  209823  scaffold00002b  209802  209838
HWUSI-EAS1825_0024_FC:8:1:8703:1105  scaffold00002d  43627   43652
HWUSI-EAS1825_0024_FC:8:1:14128:1104 scaffold00008e  22741   22767   scaffold00008e  22740   22768

in which the fields belonging to lines with the same records in $4 column are printed in a single row.

Since the rows with the same 4th field are always consecutive, I tried to test if the 4th field of the previous row is == to the same field of the actual row and to iterate this process over all the rows of my input file.

BUT...

unfortunately I have no idea on how to print the records of the actual row alongside the records of the previous row (if the "==" condition is satisfied).

Any idea?

Thanks in advance,
Luke

bed • 4.4k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 12.8 years ago by Luke ▴ 240

score 1 · Answer 1 · 2012-01-27

I created a file, 'test.txt', that contains your data as shown above. Here is a quick python solution:

#!/usr/bin/env python
import csv

with open('test.txt','r') as f:
    reader = csv.reader(f,delimiter='\t')
    prevrow=None
    for row in reader:
        if(prevrow is None):
            # initialize the first time through
            prevrow=row
            continue
        if(row[3]!=prevrow[3]):
            # single reads
            print "%s\t%s" % (row[3],"\t".join(prevrow[:3]))
            prevrow=row
        if(row[3]==prevrow[3]):
            # print pairs
            print "%s\t%s" % (row[3],"\t".join(prevrow[:3]+row[:3]))
            prevrow=None

Output is:

HWUSI-EAS1825_0024_FC:8:1:6009:1105 scaffold00002b  209798  209823  scaffold00002b  209802  209838
HWUSI-EAS1825_0024_FC:8:1:14128:1104    scaffold00002d  43627   43652
HWUSI-EAS1825_0024_FC:8:1:14128:1104    scaffold00008e  22741   22767   scaffold00008e  22741   22767