Extract start coordinate from read 1 and end coordinate from read 2 of paired end reads.
5
0
Entering edit mode
6.3 years ago
rjobmc • 0

Hi there. Im looking to extract chr, start coordinate of read 1 and the end coordinate of read 2 of paired-end NGS into one "bed" file. I have over 7 million reads and I am not sure that every paired-end read has a "pair". I have used bamtobed function of bedtools and sorted the file by read info (eg M01269....). Here is an example of what I need, For the following:

chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

I need:

chrII    404128 404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318 . 826461
chrXII  880431  880574

Any help would be very appreciated. Thanks

UNIX bedtools • 3.4k views
ADD COMMENT
1
Entering edit mode

Hello rjobmc,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Hello Could you please explain what 60 and - or + stand for ?

ADD REPLY
0
Entering edit mode

probably length of the read and strand (+ and -) rania.hamdy1

ADD REPLY
2
Entering edit mode
6.3 years ago
$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4
chrI    12085   12225
chrI    329 456


$ cat test.txt 
chrI    12085   12215   M01269:176:000000000-BW364:1:2114:9178:18263/1  60  +
chrI    12099   12225   M01269:176:000000000-BW364:1:2114:9178:18263/2  60  -
chrI    329 456 M01269:176:000000000-BW364:1:2114:9317:17550/1  60  +
chrI    330 456 M01269:176:000000000-BW364:1:2114:9317:17550/2  60  -
ADD COMMENT
0
Entering edit mode

$ awk -v OFS="\t" '$4 ~ "/1" {print $1,$2,$3}' test.txt Unfortunately this didnt work, it just pulls out the end coordinate of read 1, when I need the read 2 end coordinate

ADD REPLY
0
Entering edit mode
$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4

Worked perfectly! Thanks so much.

ADD REPLY
1
Entering edit mode

btw, cross check if this holds true if your r1 is on - instead of +. rjobmc

ADD REPLY
1
Entering edit mode

assumption is that R1 and R2 are consecutive lines in input and if you want to sort by chromosome, add sort -k1,1 at the end.:

$ sed 's/\/.*//g' test.txt | datamash -g1,4 first 2 last 3 | cut -f1,3,4
chrII   404128  404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318  826461
chrXII  880431  880574

$ cat test.txt 
chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

BTW, what would happen to multi mapping reads?

ADD REPLY
0
Entering edit mode

I checked

btw, cross check if this holds true if your r1 is on - instead of +

And it worked perfectly.

BTW, what would happen to multi mapping reads?

This was just an example but your command has worked perfectly for all reads (multi and unique). Thanks again :)

ADD REPLY
0
Entering edit mode

Hello again,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode
6.3 years ago
jomo018 ▴ 730

Python script will do it, just change ./results and ./inputFile as required. Single reads will output start and end of that single read. Blank lines (as in your input example) are not handled.

with open("./results",'w') as fho:
    with open("./inputFile",'r') as fhi:
        chr,startA,endA,idA,score,strand=fhi.readline().strip().split()
        for line in fhi:
            chr,startB,endB,idB,score,strand=line.strip().split()
            if idA[:-2]==idB[:-2]:
                endA=endB
                continue
            fho.write(chr+"\t"+startA+"\t"+endA+"\n")
            idA=idB
            startA=startB
            endA=endB   
        fho.write(chr+"\t"+startA+"\t"+endA+"\n")
ADD COMMENT
0
Entering edit mode
6.3 years ago
btsui ▴ 300
cut -d' ' -f1-3 inputfile > outputfile
ADD COMMENT
0
Entering edit mode
cut -d' ' -f1-3 inputfile > outputfile

This did not work

ADD REPLY
0
Entering edit mode
6.3 years ago

Try this:

$ sort -k4,4 inputfile|paste - -|cut -f1,2,9|sort -k1,1V -k2,2g -k3,3g > outputfile
  • sort -k4,4 inputfile sort by read name. This ensures that read1 is directly followed by read 2
  • paste - - puts two lines in one. Doing so the data for a read pair is in one line
  • cut -f1,2,9 selects the columns 1, 2 and 9 which contains chr, start read1 and end read2
  • sort -k1,1V -k2,2g -k3,3g sort by chrom, start, end

fin swimmer

ADD COMMENT
0
Entering edit mode
$ sort -k4,4 inputfile|paste - -|cut -f1,2,9 > outputfile

Thanks fin swimmer, but it has not worked. The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Is there any way to check that the read pair "code" is the same, ie its the /2 of the /1 before extracting the start and end coordinates?

ADD REPLY
0
Entering edit mode

The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Just add one more sorting step. I edited my answer and also add some comments.

ADD REPLY
0
Entering edit mode
6.3 years ago

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and a BAM sorted on queryname.

ADD COMMENT

Login before adding your answer.

Traffic: 1614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6