Hi Everyone,
I'm trying to split a .bam file by a specific tag but am running into memory issues because the .bam file is position sorted and not tag sorted.
I'd like to sort by a tag called 'BX' and cannot find any information in the samtools or bamtools documentation on this. If anyone has any suggestions it would be greatly appreciated.
What are all possible values of the tag BX? AFAIR, you can split only on RG tag using samtools, not on any arbitrary tag. But it should not be difficult as I guess you can easily "grep" each of the tags individually if they are not a lot of them, and direct output to different files.
BTW, what commands have you tried?
There are many values for BX so a grep approach would be very slow. I've done
bamtools split -in myfile.bam -tag BX
This splits the original .bam but only writes to a few of them because I'm running into this memory issue which I think is because the file is not correctly sorted by 'BX'
I'm unable to understand your reasoning :-(
So how did you deduce that 1) it's a memory issue and 2) that the bam needs to be TAG-sorted as you mentioned previously. The doc doesnt say anything about the sorting requirement AFAI can see.
Sorry this seems to be unclear. I thought I might be running into memory issues because I'm able to make the files to write to but only very few get written before the error "unable to open file tagxxx.bam for writting". My thought was that sorting could be one way to overcome the issue but I may be wrong.
Are you sure that some of rhe split bam files are generated and are correct. Normally a 'file not open for writing' problem is related to file permissions, not memory. Do you have permissions to create file in the directory where bams are supposed to be created? My hunch is that there is some silly mistake and it's not a memory issue
I believe that his happens as you correctly mention because the file is not sorted by your tag but not because of memory.
It seems to try to open a new file when the tag is encountered.
When you run on your unsorted (by tag) file, it tries to open for writing a file that has an already open file handler and breaks.
So sorting might fix your issue (or a change in the code to append, instead of opening a new file every time).
I managed to make it work in the end -k1,1g -k4.4,1n -k5,1n must be -k1,1 -k4.4,4g -k5,5n. I don't know why I had to remove the "g" in front of -k1,1 to make this work. The chromosomes are not sorted numerically but this is not really important in my case.
split or sort, what is your problem? Please specify your need by an example giving a couple of initial lines from the bam-files
Ultimately I want to split by tag but I believe I am running into memory problems because it's not sorted by the tag itself.
Here is the first line of my .bam file
ST-K00126:303:HCJNFBBXX:5:2111:5030:25791 65 chr1 9999 0 11S66M50S chr5 18606706 0 TTCTTTTTCTGGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAGAGATAACTATTGATACAACACCTTCATGACCCTAAGGTACTATCATAG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJFJJAJFJFJJJJJFFJJAF7JJJJFJJJJAJJJFJJJA<F<A<- NM:i:0 MD:Z:66 AS:i:66 XS:i:65 SA:Z:chr5,18606884,-,52M75S,0,0; RX:Z:TCCACACGTCCGGTTG QX:Z:AAFFFJJJJJJJJAFJ BX:Z:TCCACACGTCCGGTTG-1 BC:Z:AACCGTAA QT:Z:AAFFFFFJ RG:Z:35127:35127:1:HCJNFBBXX:5
What are all possible values of the tag BX? AFAIR, you can split only on RG tag using samtools, not on any arbitrary tag. But it should not be difficult as I guess you can easily "grep" each of the tags individually if they are not a lot of them, and direct output to different files. BTW, what commands have you tried?
There are many values for BX so a grep approach would be very slow. I've done
bamtools split -in myfile.bam -tag BX
This splits the original .bam but only writes to a few of them because I'm running into this memory issue which I think is because the file is not correctly sorted by 'BX'
If that works, wouldnt it be a good idea to split your data file into different sub-parts. Then join all your split files later
I'm unable to understand your reasoning :-( So how did you deduce that 1) it's a memory issue and 2) that the bam needs to be TAG-sorted as you mentioned previously. The doc doesnt say anything about the sorting requirement AFAI can see.
Sorry this seems to be unclear. I thought I might be running into memory issues because I'm able to make the files to write to but only very few get written before the error "unable to open file tagxxx.bam for writting". My thought was that sorting could be one way to overcome the issue but I may be wrong.
Are you sure that some of rhe split bam files are generated and are correct. Normally a 'file not open for writing' problem is related to file permissions, not memory. Do you have permissions to create file in the directory where bams are supposed to be created? My hunch is that there is some silly mistake and it's not a memory issue
Yes, I have permissions and a small number of the files are being written.
I believe that his happens as you correctly mention because the file is not sorted by your tag but not because of memory. It seems to try to open a new file when the tag is encountered. When you run on your unsorted (by tag) file, it tries to open for writing a file that has an already open file handler and breaks. So sorting might fix your issue (or a change in the code to append, instead of opening a new file every time).