Remove mitochondrial reads from BAM files
6
12
Entering edit mode
9.8 years ago
enricoferrero ▴ 910

Hi,

I have a somewhat high content of mitochondrial RNA in my RNA-seq experiment. Is there a way to use samtools to remove alignments to the 'MT' chromosome and keep all the rest?

I'm considering using samtools view in combination with awk but perhaps there's a better/cleaner solution?

Thanks!

awk samtools BAM RNA-Seq • 37k views
ADD COMMENT
32
Entering edit mode
9.8 years ago
matted 7.8k

You can use samtools to do it by selecting particular chromosomes and never leaving the binary format, for example:

samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam

That's a bit unwieldy to list all non-mitochondrial chromosomes manually, so you could try an ugly bash one-liner like:

samtools idxstats input.bam | cut -f 1 | grep -v MT | xargs samtools view -b input.bam > output.bam
ADD COMMENT
0
Entering edit mode

I noticed this answer after I added mine - much neater!

ADD REPLY
0
Entering edit mode

I have a 10GB BAM file of human transcriptome, I split it into chromosome-wise and all the splited BAM file s' size is less than 10GB it means some data has been lost. Can you please explain it

ADD REPLY
0
Entering edit mode

Check the number of reads per file, not the size, as it depends on the compression level.

ADD REPLY
0
Entering edit mode

I'd count the lines, instead of looking at file sizes.

ADD REPLY
0
Entering edit mode

It's also possible your original BAM kept the unmapped reads, and when you split the BAM into chromosomes the unmapped reads had nowhere to go.

ADD REPLY
0
Entering edit mode

Hi, is there a way to remove all entries after chromosome Y? I am trying to do this in a bedgraph file but the same will suffice in a BAM file.

Thanks in advance.

ADD REPLY
0
Entering edit mode

so does this (samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam) remove the listed chrosomes or actually create a new file only with those chr?

ADD REPLY
12
Entering edit mode
9.8 years ago
Ian 6.1k

I routinely cleanse my SAM files of chrM, and unassembled "random" contigs before running ChIP-seq analysis. I use 'sed' on the SAM file. Although you could be clever and do this via 'samtools view' without the need for creating an intermediate SAM file :)

sed '/chrM/d;/random/d;/chrUn/d' < file.sam > file_filtered.sam
ADD COMMENT
0
Entering edit mode

this approach worked the best for me. thanks!

ADD REPLY
5
Entering edit mode
9.8 years ago

Using my tool samjs

java -jar samjs.jar  -e 'record.getReadUnmappedFlag() || !record.getReferenceName().equals("MT")' in.bam
ADD COMMENT
0
Entering edit mode

Pierre, I tried using your tool (well the Picardtools FilterSamReads) and it keeps failing with the error:

ERROR   2016-08-31 12:31:56 FilterSamReads  Failed to filter Normal.sort.rmdup.bam
java.lang.IllegalStateException: Records B05AGABXX110313:4:1207:12805:19638 (chr1:10,001) should come after B0590ABXX110313:6:2208:18151:161276 (chr1:10,002) when sorting with htsjdk.samtools.SAMRecordQueryNameComparator

These bam files were sorted with Picardtools SortSam and duplicates removed. Any recommendations? Should I just try the tool on its own and not the Picardtools version? Thanks.

ADD REPLY
0
Entering edit mode

what's the version of picard ?

ADD REPLY
0
Entering edit mode

Thanks for the reply, It's Picard 2.1.1

ADD REPLY
0
Entering edit mode

strange. It looks like your bam is wrongly sorted. No idea.

ADD REPLY
0
Entering edit mode

Thanks. Yeah any idea on how or why a bam would be sorted incorrectly? I mean I used Picardtools to sort it. I suppose I could try sorting it with samtools...

ADD REPLY
3
Entering edit mode
7.1 years ago

This solution will keep the headers intact so you can extract fastq file

samtools view -h this.bam | awk '{if($3 != "chrM" && $3 != "chrUn"){print $0}}' | samtools view -Shb - > this.filter.bam
ADD COMMENT
0
Entering edit mode

shorter:

awk  '($3 != "chrM" && $3 != "chrUn")'
ADD REPLY
0
Entering edit mode

faster:

mawk  '($3 != "chrM" && $3 != "chrUn")'

=)

ADD REPLY
0
Entering edit mode

can i do it for multiple bam files in a folder using parallel ?

"samtools view -h this.bam | awk '{if($3 != "chrM" && $3 != "chrUn"){print $0}}' | samtools view -Shb - > this.filter.bam
ADD REPLY
2
Entering edit mode
9.8 years ago

Surprisingly, there's no way to do that with the stock version of samtools. You can do this with bedtools by intersecting with a BED file containing all of the chromosomes except MT. That might be faster than awk, depending on how bedtools implemented that.

BTW, in the off chance that you need to do this with a very large number of samples such that any awk/bedtools-based solution is too slow, this would be pretty easy to code in C with HTSlib (just let me know, I could write such a program in a couple minutes).

ADD COMMENT
0
Entering edit mode

Of course, the bedtools solution is essentially just same as giving samtools a BED file, so there's probably no gain there.

ADD REPLY
0
Entering edit mode
8.9 years ago
nash.claire ▴ 510

Hi,

I've been struggling with this and don't want to repeat a question so therefore posting here.

I've tried all the solutions and none of them are working for me. Every time I try these, if I then run samtools idxstats to see if it works, nothing changes. I even tried the long solution from matted of samtools view and typing out every single chromosome I want to keep (1-22 & X/Y) only to find that I just can't get rid of the chrM, all the chrUn etc etc. I also tried this

samtools idxstats input.bam | grep -v chrUn_* | samtools view -b input.bam > output.bam

which also didn't work to get rid of chrUn. So if starting with a bam file with all these chrUn_g* and chrM etc etc in, how can I simply get rid of them? I'll happily take a dirty messy bash one liner if it gets rid of them.

I know I'm probably doing something stupid as I'm a wet lab scientist and really don't know how to do bioinformatics properly. Unfortunately like others i'm being thrown into it with no time to learn properly and with no-one to teach me. I'm sure you real bioinformaticians out there are sick of newbies not having a clue.

ADD COMMENT
0
Entering edit mode

you should ask this as a new question or explain what 'didn't work'

also what do you expect from:

samtools idxstats input.bam | grep -v chrUn_* | samtools view -b input.bam > output.bam

as far as I know , the output of this is just the output of

samtools view -b input.bam > output.bam

which is just a copy of input.bam to output.bam

ADD REPLY
0
Entering edit mode

I don't really know what I expect as I clearly don't know what the hell I'm doing. This is what happens through teaching myself from forums. I can ask a new question but I know it annoys people when questions get repeated. What I really want is a command to use if I have a bam file that contains chrM, chrUn_gl***, chr*_gl****_random alignments in so as to get rid of said alignments and keep the "good stuff" I need. This appears to be the same as the question here but these solutions aren't working for me. And when I said it didn't work, I meant that every time I tried the solutions and then went back to samtools idxstats on the "new" bam files I'd created, absolutely nothing had changed from the original. It still contained all the chrM etc etc in. I appreciate your patience and thanks for your input. I'm afraid patience is something I've run out of with this stuff.

ADD REPLY
0
Entering edit mode

looking at my answer A: Remove mitochondrial reads from BAM files

that would be:

java -jar samjs.jar  -e 'record.getReadUnmappedFlag() || !record.getReferenceName().equals("chrM") || !record.getReferenceName().contains("_gl")' in.bam > out.sam
ADD REPLY
0
Entering edit mode

It looks like you forgot "xargs" before the second samtools command, in the example you posted. It's an easy thing to get wrong, but it does matter here because it's what takes the valid chromosome names and uses them as filtering arguments in the samtools view call. Otherwise, it's not filtering at all (since it doesn't get the chromosome name list), and the output will be the same as the input (as you see).

ADD REPLY
0
Entering edit mode

Hi thanks for the reply. I've managed to get the sed solution working but it's long and cumbersome so keen to use a quicker and easier way. So a couple of questions on your solution that uses xargs:-

  • I've heard GNU parallel can replace xargs and I already have parallel. Can this be used instead of xargs in this solution somehow?

  • If I wanted to remove chrM and chrUn etc etc at the same time in this command, how would it be written? I'm guessing grep -v <something>???

  • Is there any way to run this on multiple samples at the same time or do I have to run them one by one?

Thanks again. You guys might be saving me from throwing my mac out of a window. Along with my PI for piling bioinformatics on a cell biologist.

ADD REPLY
0
Entering edit mode

parallel vs. xargs: parallel is usually used to execute multiple commands at once on different parts of the input, which isn't what you want here. There's probably a way to get it to work, but I'm not an expert with parallel. On most systems, xargs is already installed (my Mac has xargs but not parallel, for instance).

multiple filtering: sure, you could combine multiple filtering steps with pipes (grep -v BAD1 | grep -v BAD2 | grep -v BAD3 ...). It's probably cleaner to make a BED file with chromosomes (with coordinate ranges say 1 to 1000000000, to get everything) you want to keep, and use the samtools view -L option with that file.

You can definitely run this on multiple samples at once, and there are a lot of ways to do this (parallel would be good for this, or xargs, or read up on shell scripting with loops).

Overall, I would suggest building up what you're trying to do step by step, to make sure everything makes sense as you go along. Unfortunately my original one-liner was ugly and complicated, and it fails in unexpected ways if you don't get it exactly right. But basically the first part (samtools idxstats input.bam | cut -f 1 | grep -v MT) is only to build a list of chromosomes to keep, and the second part (with xargs) is just a trick to pass that list into samtools view. You can try simpler ways to make the list of valid chromosomes, and make sure that works first, perhaps on a small subset of the whole bam (see samtools view -s). Maybe also experiment with samtools view -c, which will just give the number of reads matching your filtering conditions. Then you can quickly see if your filtering is doing anything at all, or roughly matching what you'd expect.

ADD REPLY
1
Entering edit mode

Thank you so much for the advice. Much appreciated. Yep I think much reading is needed. Or a degree in bioinformatics would be great haha!

ADD REPLY

Login before adding your answer.

Traffic: 1770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6