Extract start coordinate from read 1 and end coordinate from read 2 of paired end reads.
5
0
Entering edit mode
6.4 years ago
rjobmc • 0

Hi there. Im looking to extract chr, start coordinate of read 1 and the end coordinate of read 2 of paired-end NGS into one "bed" file. I have over 7 million reads and I am not sure that every paired-end read has a "pair". I have used bamtobed function of bedtools and sorted the file by read info (eg M01269....). Here is an example of what I need, For the following:

chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

I need:

chrII    404128 404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318 . 826461
chrXII  880431  880574

Any help would be very appreciated. Thanks

UNIX bedtools • 3.5k views
ADD COMMENT
1
Entering edit mode

Hello rjobmc,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Hello Could you please explain what 60 and - or + stand for ?

ADD REPLY
0
Entering edit mode

probably length of the read and strand (+ and -) rania.hamdy1

ADD REPLY
2
Entering edit mode
6.4 years ago
$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4
chrI    12085   12225
chrI    329 456


$ cat test.txt 
chrI    12085   12215   M01269:176:000000000-BW364:1:2114:9178:18263/1  60  +
chrI    12099   12225   M01269:176:000000000-BW364:1:2114:9178:18263/2  60  -
chrI    329 456 M01269:176:000000000-BW364:1:2114:9317:17550/1  60  +
chrI    330 456 M01269:176:000000000-BW364:1:2114:9317:17550/2  60  -
ADD COMMENT
0
Entering edit mode

$ awk -v OFS="\t" '$4 ~ "/1" {print $1,$2,$3}' test.txt Unfortunately this didnt work, it just pulls out the end coordinate of read 1, when I need the read 2 end coordinate

ADD REPLY
0
Entering edit mode
$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4

Worked perfectly! Thanks so much.

ADD REPLY
1
Entering edit mode

btw, cross check if this holds true if your r1 is on - instead of +. rjobmc

ADD REPLY
1
Entering edit mode

assumption is that R1 and R2 are consecutive lines in input and if you want to sort by chromosome, add sort -k1,1 at the end.:

$ sed 's/\/.*//g' test.txt | datamash -g1,4 first 2 last 3 | cut -f1,3,4
chrII   404128  404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318  826461
chrXII  880431  880574

$ cat test.txt 
chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

BTW, what would happen to multi mapping reads?

ADD REPLY
0
Entering edit mode

I checked

btw, cross check if this holds true if your r1 is on - instead of +

And it worked perfectly.

BTW, what would happen to multi mapping reads?

This was just an example but your command has worked perfectly for all reads (multi and unique). Thanks again :)

ADD REPLY
0
Entering edit mode

Hello again,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode
6.4 years ago
jomo018 ▴ 730

Python script will do it, just change ./results and ./inputFile as required. Single reads will output start and end of that single read. Blank lines (as in your input example) are not handled.

with open("./results",'w') as fho:
    with open("./inputFile",'r') as fhi:
        chr,startA,endA,idA,score,strand=fhi.readline().strip().split()
        for line in fhi:
            chr,startB,endB,idB,score,strand=line.strip().split()
            if idA[:-2]==idB[:-2]:
                endA=endB
                continue
            fho.write(chr+"\t"+startA+"\t"+endA+"\n")
            idA=idB
            startA=startB
            endA=endB   
        fho.write(chr+"\t"+startA+"\t"+endA+"\n")
ADD COMMENT
0
Entering edit mode
6.4 years ago
btsui ▴ 300
cut -d' ' -f1-3 inputfile > outputfile
ADD COMMENT
0
Entering edit mode
cut -d' ' -f1-3 inputfile > outputfile

This did not work

ADD REPLY
0
Entering edit mode
6.4 years ago

Try this:

$ sort -k4,4 inputfile|paste - -|cut -f1,2,9|sort -k1,1V -k2,2g -k3,3g > outputfile
  • sort -k4,4 inputfile sort by read name. This ensures that read1 is directly followed by read 2
  • paste - - puts two lines in one. Doing so the data for a read pair is in one line
  • cut -f1,2,9 selects the columns 1, 2 and 9 which contains chr, start read1 and end read2
  • sort -k1,1V -k2,2g -k3,3g sort by chrom, start, end

fin swimmer

ADD COMMENT
0
Entering edit mode
$ sort -k4,4 inputfile|paste - -|cut -f1,2,9 > outputfile

Thanks fin swimmer, but it has not worked. The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Is there any way to check that the read pair "code" is the same, ie its the /2 of the /1 before extracting the start and end coordinates?

ADD REPLY
0
Entering edit mode

The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Just add one more sorting step. I edited my answer and also add some comments.

ADD REPLY
0
Entering edit mode
6.4 years ago

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and a BAM sorted on queryname.

final List<SAMRecord> buffer= new ArrayList<>();
for(;;)
{
final SAMRecord rec = iter.hasNext()?iter.next():null;
if(rec!=null && (rec.isSecondaryOrSupplementary() || rec.getReadUnmappedFlag())) continue;
if(rec==null || (!buffer.isEmpty() && !buffer.get(0).getReadName().equals(rec.getReadName())))
{
List<Interval> intervals = new ArrayList<>();
for(final SAMRecord rec2:buffer)
{
Interval curr = new Interval(rec2.getContig(),rec2.getStart(),rec2.getEnd());
while(curr!=null)
{
int i=0;
for(i=0;i< intervals.size();++i)
{
if(curr.getContig().equals(intervals.get(i).getContig()))
{
Interval prev = intervals.remove(i);
curr = new Interval(curr.getContig(),
Math.min(curr.getStart(),prev.getStart()),
Math.max(curr.getEnd(),prev.getEnd())
);
break;
}
}
if(i==intervals.size())
{
intervals.add(curr);
curr=null;
}
}
}
for(int x=0;x< intervals.size();++x)
{
final Interval i = intervals.get(x);
out.println(i.getContig()+"\t"+(i.getStart()-1)+"\t"+i.getEnd()+"\t["+(x+1)+"]\t"+buffer.get(0).getReadName());
}
if(rec==null) break;
buffer.clear();
}
buffer.add(rec);
}
java -jar picard.jar SortSam I=src/test/resources/S1.bam O=/dev/stdout SO=queryname |\
java -jar dist/bioalcidaejdk.jar -F BAM -f biostar335056.code
RF01 1016 1539 [1] RF01_1017_1539_0:0:0_1:0:0_b
RF01 1022 1574 [1] RF01_1023_1574_3:0:0_1:0:0_90
RF01 1026 1600 [1] RF01_1027_1600_0:0:0_1:0:0_57
RF01 101 665 [1] RF01_102_665_1:0:0_1:0:0_71
RF01 1029 1585 [1] RF01_1030_1585_1:0:0_0:0:0_76
RF01 1040 1644 [1] RF01_1041_1644_2:0:0_0:0:0_40
RF01 1060 1523 [1] RF01_1061_1523_0:0:0_1:0:0_72
RF01 1094 1557 [1] RF01_1095_1557_2:0:0_1:0:0_4a
RF01 109 504 [1] RF01_110_504_2:0:0_1:0:0_5d
RF01 1117 1633 [1] RF01_1118_1633_2:0:0_1:0:0_8a
RF01 1142 1588 [1] RF01_1143_1588_0:0:0_1:0:0_35
RF01 1142 1646 [1] RF01_1143_1646_1:0:0_1:0:0_19
RF01 1163 1776 [1] RF01_1164_1776_1:0:0_1:0:0_97
RF01 1168 1612 [1] RF01_1169_1612_2:0:0_0:0:0_7d
(...)
view raw example.sh hosted with ❤ by GitHub

ADD COMMENT

Login before adding your answer.

Traffic: 3871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6