Extract start coordinate from read 1 and end coordinate from read 2 of paired end reads.

0

Entering edit mode

7.0 years ago

rjobmc • 0

Hi there. Im looking to extract chr, start coordinate of read 1 and the end coordinate of read 2 of paired-end NGS into one "bed" file. I have over 7 million reads and I am not sure that every paired-end read has a "pair". I have used bamtobed function of bedtools and sorted the file by read info (eg M01269....). Here is an example of what I need, For the following:

chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

I need:

chrII    404128 404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318 . 826461
chrXII  880431  880574

Any help would be very appreciated. Thanks

UNIX bedtools • 4.2k views

ADD COMMENT • link updated 7.0 years ago by finswimmer 16k • written 7.0 years ago by rjobmc • 0

1

Entering edit mode

Hello rjobmc,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

Hello Could you please explain what 60 and - or + stand for ?

ADD REPLY • link 6.7 years ago by rania.hamdy1 • 0

0

Entering edit mode

probably length of the read and strand (+ and -) rania.hamdy1

ADD REPLY • link 6.7 years ago by cpad0112 21k

2

Entering edit mode

7.0 years ago

cpad0112 21k

$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4
chrI    12085   12225
chrI    329 456


$ cat test.txt 
chrI    12085   12215   M01269:176:000000000-BW364:1:2114:9178:18263/1  60  +
chrI    12099   12225   M01269:176:000000000-BW364:1:2114:9178:18263/2  60  -
chrI    329 456 M01269:176:000000000-BW364:1:2114:9317:17550/1  60  +
chrI    330 456 M01269:176:000000000-BW364:1:2114:9317:17550/2  60  -

ADD COMMENT • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

$ awk -v OFS="\t" '$4 ~ "/1" {print $1,$2,$3}' test.txt Unfortunately this didnt work, it just pulls out the end coordinate of read 1, when I need the read 2 end coordinate

ADD REPLY • link 7.0 years ago by rjobmc • 0

0

Entering edit mode

$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4

Worked perfectly! Thanks so much.

ADD REPLY • link 7.0 years ago by rjobmc • 0

1

Entering edit mode

btw, cross check if this holds true if your r1 is on - instead of +. rjobmc

ADD REPLY • link 7.0 years ago by cpad0112 21k

1

Entering edit mode

assumption is that R1 and R2 are consecutive lines in input and if you want to sort by chromosome, add sort -k1,1 at the end.:

$ sed 's/\/.*//g' test.txt | datamash -g1,4 first 2 last 3 | cut -f1,3,4
chrII   404128  404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318  826461
chrXII  880431  880574

$ cat test.txt 
chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

BTW, what would happen to multi mapping reads?

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

I checked

btw, cross check if this holds true if your r1 is on - instead of +

And it worked perfectly.

BTW, what would happen to multi mapping reads?

This was just an example but your command has worked perfectly for all reads (multi and unique). Thanks again :)

ADD REPLY • link 7.0 years ago by rjobmc • 0

0

Entering edit mode

Hello again,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY • link 7.0 years ago by finswimmer 16k

1

Entering edit mode

7.0 years ago

jomo018 ▴ 730

Python script will do it, just change ./results and ./inputFile as required. Single reads will output start and end of that single read. Blank lines (as in your input example) are not handled.

with open("./results",'w') as fho:
    with open("./inputFile",'r') as fhi:
        chr,startA,endA,idA,score,strand=fhi.readline().strip().split()
        for line in fhi:
            chr,startB,endB,idB,score,strand=line.strip().split()
            if idA[:-2]==idB[:-2]:
                endA=endB
                continue
            fho.write(chr+"\t"+startA+"\t"+endA+"\n")
            idA=idB
            startA=startB
            endA=endB   
        fho.write(chr+"\t"+startA+"\t"+endA+"\n")

ADD COMMENT • link 7.0 years ago by jomo018 ▴ 730

0

Entering edit mode

7.0 years ago

btsui ▴ 300

cut -d' ' -f1-3 inputfile > outputfile

ADD COMMENT • link updated 7.0 years ago by finswimmer 16k • written 7.0 years ago by btsui ▴ 300

0

Entering edit mode

cut -d' ' -f1-3 inputfile > outputfile

This did not work

ADD REPLY • link updated 7.0 years ago by finswimmer 16k • written 7.0 years ago by rjobmc • 0

0

Entering edit mode

7.0 years ago

finswimmer 16k

Try this:

$ sort -k4,4 inputfile|paste - -|cut -f1,2,9|sort -k1,1V -k2,2g -k3,3g > outputfile

sort -k4,4 inputfile sort by read name. This ensures that read1 is directly followed by read 2
paste - - puts two lines in one. Doing so the data for a read pair is in one line
cut -f1,2,9 selects the columns 1, 2 and 9 which contains chr, start read1 and end read2
sort -k1,1V -k2,2g -k3,3g sort by chrom, start, end

fin swimmer

ADD COMMENT • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

$ sort -k4,4 inputfile|paste - -|cut -f1,2,9 > outputfile

Thanks fin swimmer, but it has not worked. The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Is there any way to check that the read pair "code" is the same, ie its the /2 of the /1 before extracting the start and end coordinates?

ADD REPLY • link 7.0 years ago by rjobmc • 0

0

Entering edit mode

The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Just add one more sorting step. I edited my answer and also add some comments.

ADD REPLY • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

7.0 years ago

Pierre Lindenbaum 166k

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and a BAM sorted on queryname.

	final List<SAMRecord> buffer= new ArrayList<>();
	for(;;)
	{
	final SAMRecord rec = iter.hasNext()?iter.next():null;
	if(rec!=null && (rec.isSecondaryOrSupplementary() \|\| rec.getReadUnmappedFlag())) continue;
	if(rec==null \|\| (!buffer.isEmpty() && !buffer.get(0).getReadName().equals(rec.getReadName())))
	{
	List<Interval> intervals = new ArrayList<>();
	for(final SAMRecord rec2:buffer)
	{
	Interval curr = new Interval(rec2.getContig(),rec2.getStart(),rec2.getEnd());
	while(curr!=null)
	{
	int i=0;
	for(i=0;i< intervals.size();++i)
	{
	if(curr.getContig().equals(intervals.get(i).getContig()))
	{
	Interval prev = intervals.remove(i);
	curr = new Interval(curr.getContig(),
	Math.min(curr.getStart(),prev.getStart()),
	Math.max(curr.getEnd(),prev.getEnd())
	);
	break;
	}
	}
	if(i==intervals.size())
	{
	intervals.add(curr);
	curr=null;
	}
	}
	}


	for(int x=0;x< intervals.size();++x)
	{
	final Interval i = intervals.get(x);
	out.println(i.getContig()+"\t"+(i.getStart()-1)+"\t"+i.getEnd()+"\t["+(x+1)+"]\t"+buffer.get(0).getReadName());
	}
	if(rec==null) break;
	buffer.clear();
	}
	buffer.add(rec);
	}

view raw biostar335056.code hosted with ❤ by GitHub

	java -jar picard.jar SortSam I=src/test/resources/S1.bam O=/dev/stdout SO=queryname \|\
	java -jar dist/bioalcidaejdk.jar -F BAM -f biostar335056.code

	RF01 1016 1539 [1] RF01_1017_1539_0:0:0_1:0:0_b
	RF01 1022 1574 [1] RF01_1023_1574_3:0:0_1:0:0_90
	RF01 1026 1600 [1] RF01_1027_1600_0:0:0_1:0:0_57
	RF01 101 665 [1] RF01_102_665_1:0:0_1:0:0_71
	RF01 1029 1585 [1] RF01_1030_1585_1:0:0_0:0:0_76
	RF01 1040 1644 [1] RF01_1041_1644_2:0:0_0:0:0_40
	RF01 1060 1523 [1] RF01_1061_1523_0:0:0_1:0:0_72
	RF01 1094 1557 [1] RF01_1095_1557_2:0:0_1:0:0_4a
	RF01 109 504 [1] RF01_110_504_2:0:0_1:0:0_5d
	RF01 1117 1633 [1] RF01_1118_1633_2:0:0_1:0:0_8a
	RF01 1142 1588 [1] RF01_1143_1588_0:0:0_1:0:0_35
	RF01 1142 1646 [1] RF01_1143_1646_1:0:0_1:0:0_19
	RF01 1163 1776 [1] RF01_1164_1776_1:0:0_1:0:0_97
	RF01 1168 1612 [1] RF01_1169_1612_2:0:0_0:0:0_7d
	(...)

view raw example.sh hosted with ❤ by GitHub

ADD COMMENT • link 7.0 years ago by Pierre Lindenbaum 166k

Login before adding your answer.