I have a somewhat high content of mitochondrial RNA in my RNA-seq experiment. Is there a way to use samtools to remove alignments to the 'MT' chromosome and keep all the rest?
I'm considering using samtools view in combination with awk but perhaps there's a better/cleaner solution?
I noticed this answer after I added mine - much neater!
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
Ian
6.1k
0
Entering edit mode
I have a 10GB BAM file of human transcriptome, I split it into chromosome-wise and all the splited BAM file s' size is less than 10GB it means some data has been lost. Can you please explain it
so does this (samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam) remove the listed chrosomes or actually create a new file only with those chr?
I routinely cleanse my SAM files of chrM, and unassembled "random" contigs before running ChIP-seq analysis. I use 'sed' on the SAM file. Although you could be clever and do this via 'samtools view' without the need for creating an intermediate SAM file :)
sed '/chrM/d;/random/d;/chrUn/d' < file.sam > file_filtered.sam
ADD COMMENT
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
Ian
6.1k
Pierre, I tried using your tool (well the Picardtools FilterSamReads) and it keeps failing with the error:
ERROR 2016-08-31 12:31:56 FilterSamReads Failed to filter Normal.sort.rmdup.bam
java.lang.IllegalStateException: Records B05AGABXX110313:4:1207:12805:19638 (chr1:10,001) should come after B0590ABXX110313:6:2208:18151:161276 (chr1:10,002) when sorting with htsjdk.samtools.SAMRecordQueryNameComparator
These bam files were sorted with Picardtools SortSam and duplicates removed. Any recommendations? Should I just try the tool on its own and not the Picardtools version? Thanks.
Thanks. Yeah any idea on how or why a bam would be sorted incorrectly? I mean I used Picardtools to sort it. I suppose I could try sorting it with samtools...
Surprisingly, there's no way to do that with the stock version of samtools. You can do this with bedtools by intersecting with a BED file containing all of the chromosomes except MT. That might be faster than awk, depending on how bedtools implemented that.
BTW, in the off chance that you need to do this with a very large number of samples such that any awk/bedtools-based solution is too slow, this would be pretty easy to code in C with HTSlib (just let me know, I could write such a program in a couple minutes).
I've been struggling with this and don't want to repeat a question so therefore posting here.
I've tried all the solutions and none of them are working for me. Every time I try these, if I then run samtools idxstats to see if it works, nothing changes. I even tried the long solution from matted of samtools view and typing out every single chromosome I want to keep (1-22 & X/Y) only to find that I just can't get rid of the chrM, all the chrUn etc etc. I also tried this
which also didn't work to get rid of chrUn. So if starting with a bam file with all these chrUn_g* and chrM etc etc in, how can I simply get rid of them? I'll happily take a dirty messy bash one liner if it gets rid of them.
I know I'm probably doing something stupid as I'm a wet lab scientist and really don't know how to do bioinformatics properly. Unfortunately like others i'm being thrown into it with no time to learn properly and with no-one to teach me. I'm sure you real bioinformaticians out there are sick of newbies not having a clue.
I don't really know what I expect as I clearly don't know what the hell I'm doing. This is what happens through teaching myself from forums. I can ask a new question but I know it annoys people when questions get repeated. What I really want is a command to use if I have a bam file that contains chrM, chrUn_gl***, chr*_gl****_random alignments in so as to get rid of said alignments and keep the "good stuff" I need. This appears to be the same as the question here but these solutions aren't working for me. And when I said it didn't work, I meant that every time I tried the solutions and then went back to samtools idxstats on the "new" bam files I'd created, absolutely nothing had changed from the original. It still contained all the chrM etc etc in. I appreciate your patience and thanks for your input. I'm afraid patience is something I've run out of with this stuff.
It looks like you forgot "xargs" before the second samtools command, in the example you posted. It's an easy thing to get wrong, but it does matter here because it's what takes the valid chromosome names and uses them as filtering arguments in the samtools view call. Otherwise, it's not filtering at all (since it doesn't get the chromosome name list), and the output will be the same as the input (as you see).
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.9 years ago by
matted
7.8k
0
Entering edit mode
Hi thanks for the reply. I've managed to get the sed solution working but it's long and cumbersome so keen to use a quicker and easier way. So a couple of questions on your solution that uses xargs:-
I've heard GNU parallel can replace xargs and I already have parallel. Can this be used instead of xargs in this solution somehow?
If I wanted to remove chrM and chrUn etc etc at the same time in this command, how would it be written? I'm guessing grep -v <something>???
Is there any way to run this on multiple samples at the same time or do I have to run them one by one?
Thanks again. You guys might be saving me from throwing my mac out of a window. Along with my PI for piling bioinformatics on a cell biologist.
parallel vs. xargs: parallel is usually used to execute multiple commands at once on different parts of the input, which isn't what you want here. There's probably a way to get it to work, but I'm not an expert with parallel. On most systems, xargs is already installed (my Mac has xargs but not parallel, for instance).
multiple filtering: sure, you could combine multiple filtering steps with pipes (grep -v BAD1 | grep -v BAD2 | grep -v BAD3 ...). It's probably cleaner to make a BED file with chromosomes (with coordinate ranges say 1 to 1000000000, to get everything) you want to keep, and use the samtools view -L option with that file.
You can definitely run this on multiple samples at once, and there are a lot of ways to do this (parallel would be good for this, or xargs, or read up on shell scripting with loops).
Overall, I would suggest building up what you're trying to do step by step, to make sure everything makes sense as you go along. Unfortunately my original one-liner was ugly and complicated, and it fails in unexpected ways if you don't get it exactly right. But basically the first part (samtools idxstats input.bam | cut -f 1 | grep -v MT) is only to build a list of chromosomes to keep, and the second part (with xargs) is just a trick to pass that list into samtools view. You can try simpler ways to make the list of valid chromosomes, and make sure that works first, perhaps on a small subset of the whole bam (see samtools view -s). Maybe also experiment with samtools view -c, which will just give the number of reads matching your filtering conditions. Then you can quickly see if your filtering is doing anything at all, or roughly matching what you'd expect.
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.9 years ago by
matted
7.8k
1
Entering edit mode
Thank you so much for the advice. Much appreciated. Yep I think much reading is needed. Or a degree in bioinformatics would be great haha!
I noticed this answer after I added mine - much neater!
I have a 10GB BAM file of human transcriptome, I split it into chromosome-wise and all the splited BAM file s' size is less than 10GB it means some data has been lost. Can you please explain it
Check the number of reads per file, not the size, as it depends on the compression level.
I'd count the lines, instead of looking at file sizes.
It's also possible your original BAM kept the unmapped reads, and when you split the BAM into chromosomes the unmapped reads had nowhere to go.
Hi, is there a way to remove all entries after chromosome Y? I am trying to do this in a bedgraph file but the same will suffice in a BAM file.
Thanks in advance.
so does this (samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam) remove the listed chrosomes or actually create a new file only with those chr?