Efficiently join paired-end read coordinates in the same line?
2
1
Entering edit mode
10.0 years ago

Dear community,

I have a huge paired-end HiC dataset (BAM format) which I want to format like this way:

HWI-D00283:117:C5KKJANXX:2:1101:1139:77789    chr6    153338506    153338556    37    +    chr6    153338031    153338081    37    -
HWI-D00283:117:C5KKJANXX:2:1101:1139:77856    chr6    149915169    149915219    37    -    chr6    149914908    149914958    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:79414    chr4    184474969    184475019    37    -    chr4    184474811    184474861    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:81280    chr6    153641723    153641773    37    -    chr6    153641551    153641601    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:81917    chr8    87070282    87070332    37    -    chr8    87069851    87069901    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:82575    chr17    56970884    56970934    37    -    chr6    151400450    151400500    37    -
HWI-D00283:117:C5KKJANXX:2:1101:1139:86642    chr6    150043041    150043091    37    -    chr6    150042915    150042965    37    +

This is an example which I obtained by first converting the BAM format to BED and separating each mate into different files and then with a AWK command joined the mates.

This is the awk command I used:

awk 'NR==FNR {h[$4] = $1"\t"$2"\t"$3"\t"$5"\t"$6; next} {OFS="\t"; print $4,$1,$2,$3,$5,$6,h[$4]}' mate1 mate2

This command worked fine with a small dataset (1M, 10M reads), but when I tried with 200M reads file, it crashes because memory reasons I suppose. Is there a way to efficiently join paired-end reads as I showed in my example?

Thanks!

HiC paired-end 4C 3C • 2.8k views
ADD COMMENT
3
Entering edit mode
10.0 years ago

extract the mapped reads F, sort on name:

samtools view -f 64 -F 3976 your.bam | cut -f LC_ALL=C sort -k1,1 > F.txt

extract the mapped read R, sort on name:

samtools view -f 128 -F 3976 your.bam | LC_ALL=C sort -k1,1 > R.txt

join both files and format the output with awk:

LC_ALL=C join -t '    ' -1 1 -2 1 F.txt R.txt | awk -f your.script > result.txt
ADD COMMENT
1
Entering edit mode

-f 64 -F 3912 won't work very well :) I assume you mean -f 64 -F 3976

ADD REPLY
0
Entering edit mode

yes, thanks !

ADD REPLY
0
Entering edit mode

Thank you! Worked fine!

ADD REPLY
1
Entering edit mode
10.0 years ago

If you're OK with query-sorting the BAM files first, then something like the following script using pysam should work (this is untested)

#!/usr/bin/env python
import pysam
fp = pysam.Samfile("foo.bam", "rb")
of = open("somefile.txt", "w")
for b in fp.fetch() :
    if b.mate_is_unmapped :
        continue
    if b.is_read1 :
        of.write("%s\t%s\t%i\t%i\t%i" % (b.qname, fp.getrname(b.tid), b.pos, b.aend, b.mapq))
        if b.is_reverse :
            of.write("\t-")
        else :
            of.write("\t+")
    else :
        of.write("\t%s\t%i\t%i\t%i" % (fp.getrname(b.tid), b.pos, b.aend, b.mapq))
        if b.is_reverse :
            of.write("\t-\n")
        else :
            of.write("\t+\n")
of.close()
fp.close()
ADD COMMENT
0
Entering edit mode

you should check that the reads are properly paired (<- problem with discordant pairs)

ADD REPLY
0
Entering edit mode

Discordant pairs are largely the species of interest with HiC (or at least I'd think so...I've never done it).

ADD REPLY

Login before adding your answer.

Traffic: 2887 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6