Dear community,
I have a huge paired-end HiC dataset (BAM format) which I want to format like this way:
HWI-D00283:117:C5KKJANXX:2:1101:1139:77789 chr6 153338506 153338556 37 + chr6 153338031 153338081 37 -
HWI-D00283:117:C5KKJANXX:2:1101:1139:77856 chr6 149915169 149915219 37 - chr6 149914908 149914958 37 +
HWI-D00283:117:C5KKJANXX:2:1101:1139:79414 chr4 184474969 184475019 37 - chr4 184474811 184474861 37 +
HWI-D00283:117:C5KKJANXX:2:1101:1139:81280 chr6 153641723 153641773 37 - chr6 153641551 153641601 37 +
HWI-D00283:117:C5KKJANXX:2:1101:1139:81917 chr8 87070282 87070332 37 - chr8 87069851 87069901 37 +
HWI-D00283:117:C5KKJANXX:2:1101:1139:82575 chr17 56970884 56970934 37 - chr6 151400450 151400500 37 -
HWI-D00283:117:C5KKJANXX:2:1101:1139:86642 chr6 150043041 150043091 37 - chr6 150042915 150042965 37 +
This is an example which I obtained by first converting the BAM format to BED and separating each mate into different files and then with a AWK command joined the mates.
This is the awk command I used:
awk 'NR==FNR {h[$4] = $1"\t"$2"\t"$3"\t"$5"\t"$6; next} {OFS="\t"; print $4,$1,$2,$3,$5,$6,h[$4]}' mate1 mate2
This command worked fine with a small dataset (1M, 10M reads), but when I tried with 200M reads file, it crashes because memory reasons I suppose. Is there a way to efficiently join paired-end reads as I showed in my example?
Thanks!
-f 64 -F 3912
won't work very well :) I assume you mean-f 64 -F 3976
yes, thanks !
Thank you! Worked fine!