Getting Unmapped Reads: Comparing Fastq To Bam
4
8
Entering edit mode
13.0 years ago
User 9996 ▴ 840

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

fastq samtools bedtools bam • 21k views
ADD COMMENT
2
Entering edit mode

Every BAM file I've seen contains all reads, including unmapped ones. Are you sure that unmapped reads aren't present in your BAM file? Or am I misundersatnding

ADD REPLY
1
Entering edit mode

I've mainly used BWA, so that's been my experience. I just checked bowtie, and if you run it as /path/to/bowtie --sam <genome> <input_file>, all reads are still reported, including unmapped reads (FLAG is 4, RNAME is *, POS and MAPQ are 0)

ADD REPLY
0
Entering edit mode

bwa definately included unmapped reads, I'm not sure bowtie does by default.

ADD REPLY
0
Entering edit mode

These BAMs are produced by Tophat. My understanding is that Tophat only reports aligned reads from FASTQ, isn't that right?

ADD REPLY
0
Entering edit mode

I've never used TopHat, but a quick peek online makes it look like they only report mapped reads. First time I've seen that...

ADD REPLY
0
Entering edit mode

TopHat only puts in the mapped reads - it'll put them in multiple times if they map to many locations as well - so you may need to pipe column 1 into sort/uniq UNIX pipes to get unique mapped read names.

ADD REPLY
8
Entering edit mode
13.0 years ago
brentp 24k

If you sort the mapped BAM names and the read-names, you can use linux join to print out unpaired--e.g. unmapped reads.

FQ=$1
BAM=$2

# print out mapped headers
samtools view -F 4 $BAM | awk '{ print "@"$1"/1" }' | sort -u > mapped.txt
# grap only the fq headers.
awk '(NR % 4 == 1)' $FQ |  sort -u > reads.txt
# join and print only un-paired (-v)
join -v 1 reads.txt mapped.txt > unmapped.txt

call like::

bash unmapped.sh input.fq mapped.bam

and the unmapped reads will be in unmapped.txt, from there, you can use grep -f to get back a fastq file. Note, you may have to change the awk command (currently '{ print "@"$1"/1" }') to either not add the trailing "/1" or to add "/2" depending on the read end (or depending on the samtools flags.

ADD COMMENT
0
Entering edit mode

what is reads/WT... etc? in here? SHould that be changed to $FQ?

ADD REPLY
0
Entering edit mode

yes. thanks....

ADD REPLY
0
Entering edit mode

won't sort -u run into the same issues I point below? where sorting with unix is different from samtools sort -n? I can't get them to sort the same way it's very frustrating

ADD REPLY
0
Entering edit mode

No, in this example brentp is using unix's sort for both files.

ADD REPLY
6
Entering edit mode
13.0 years ago
Mitch Bekritsky ★ 1.3k

Get the query names from the BAM file:

samtools view <input file> | cut -f1 | sort > BAM_headers.txt

Get the query names from your FASTQ file (assuming your read length is 101bp):

awk '$0 ~ /^@/ && length($0) < 101' <fastq_file> | sed 's/^@//' | sort > FASTQ_headers.txt

Use diff to compare the two files.

diff BAM_headers.txt FASTQ_headers.txt

Since all reads in the FASTQ headers file will also be in the BAM headers file, diff should only show query names that are in the FASTQ file that are not in the BAM file. The output should show something like this:

7a8,9
> D74RYQN1_0189:1:2208:17093:200915#0
> D74RYQN1_0189:1:2208:17337:200823#0
8a11,12
> D74RYQN1_0189:1:2208:19283:200878#0
> D74RYQN1_0189:1:2208:19604:200869#0

That's your list of missing read names. You can parse the diff output to get the read names without leading '>' symbols, then use grep to get the actual reads from the FASTQ file if you'd like. If you want to use this approach and need help figuring out the last two parts, let me know.

ADD COMMENT
1
Entering edit mode

Do you think rlong's algorithm will work? It seems to me you have to have the FASTQ names sorted as well... I'd like to avoid awk and do it from within python if possible

ADD REPLY
0
Entering edit mode

I do like your approach by the way, but it will have to take into account the trailing / or # signs that are used for paired end reads, since many alignment programs drop these. for example tophat will drop the /1 or /2 of the FASTQ read mate 1 and read mate 2 and only report the name that comes before the '/'. it's quite annoying

ADD REPLY
0
Entering edit mode

For rlong's approach to work, you would need to sort the FASTQ headers as well. Otherwise, you'd have to store the BAM headers as a dictionary or a hash, then as you read the FASTQ file, check to see if it occurs in the dictionary. This would be extremely memory inefficient, since you'd have to hold all the BAM files in memory. Sadly, my python expertise is limited. All the parsing is definitely doable in python, but I don't know if there's an implementation of diff in python. If the output of TopHat chops off the '/' or '#', you can use sed to get rid of them. See next comment for that

ADD REPLY
0
Entering edit mode

sed 's/[#/].*$//' FASTQ_headers.txt (or a python regular expression equivalent) would get rid of any # or / symbols in the read. Be careful, because if they occur anywhere else in the read (they shouldn't), then it will truncate from there. Also, in my previous comment, I meant that rlong's approach would have to hold all the BAM query names in memory, not the whole file.

ADD REPLY
0
Entering edit mode

One more thing...will tophat output reads where only one paired end maps?

ADD REPLY
0
Entering edit mode

I like this solution. You only traverse each input file once, limiting IO. Then if you UNKNOWN cares about the contents of the reads, and not just the read names, you have final traversal of the fastq to drop out the reads that match your list of those in the fastq but not the bam.

ADD REPLY
0
Entering edit mode

@mitch: i am running it on each mate separately so it's not an issue

ADD REPLY
0
Entering edit mode

Thanks! It comes from long hours trying to find BASH commands that do things I need without me needing to do lots of coding.

ADD REPLY
0
Entering edit mode

the mate id would not be a problem if you run it on each end separately. if i could find a way to efficiently sort the FASTQ by entry id i think rlong's approach would work, since sorting the IDs or sorting the ids where each ends in a "/1" is the same thing.

ADD REPLY
0
Entering edit mode

Great to know @google. If you were mapping paired-end reads together, there's another bash trick that would tell you if one or both reads didn't map, but it's not necessary here.

ADD REPLY
0
Entering edit mode

The sort for reads ending in a '/1' isn't the problem, from my view. It's the string comparison with the BAM file, which (from what I've seen) won't contain the trailing '/1', and so diff will not be able to match any read headers between the FASTQ and BAM files

ADD REPLY
0
Entering edit mode

Thanks Mitch. I tried the solution I described here -- but it's pretty slow, it takes 10 minutes per fastq file. Is that expected??

ADD REPLY
0
Entering edit mode

Unfortunately, yes; for large files, sorting can take a bit of time. Depending on the sorting algorithm used and how many entries you're trying to sort, 10 minutes isn't terrible, even if it seems like it is ;)

ADD REPLY
0
Entering edit mode

great thanks for all the comments mitch and rlong, was very helpful!

ADD REPLY
0
Entering edit mode

My pleasure! Happy to help...

ADD REPLY
0
Entering edit mode

Just when I thought it was working.. I get errors like: "sort: write failed: /tmp/sortV2AYRD: No space left on device" -- is that normal? how can it be avoided?

ADD REPLY
0
Entering edit mode

Are you sorting in python or on the command line in bash? It sounds like something is out of space. Either it's your hard drive (check with df if you're on unix), or the sort is running out of RAM when it loads all the read headers into memory. If your hard drive still has lots of space, your best bet is to go with command line sort or samtools sort. When they encounter a file that's too large, they'll split the file into smaller chunks, sort the smaller chunks, then merge them together. It would be a pain to code that yourself if you're trying to get it done on a Sunday night.

ADD REPLY
0
Entering edit mode

that error actually looks like it's when the command line sort is running out of space. How big are your header files from the bam or fastq?

ADD REPLY
0
Entering edit mode

i've managed to get it to work the problem now is that the downstream step that rlong suggested doesn't work. i think sort from unix sorts differently than sort -n in bamtools?

ADD REPLY
0
Entering edit mode

From the samtools man page, sort -n sorts by read header, which should give the same output. When you say it doesn't work, does the file not look sorted? Or is there some other issue?

ADD REPLY
0
Entering edit mode

The file gets sorted by sort, but it thinks that "@42EBKAAXX090828:6:73:204:1871/2" comes before "@42EBKAAXX090828:6:1:270:128/2", for example. not sure why. does unix "sort" do something funny with alphanumeric strings?

ADD REPLY
0
Entering edit mode

Example:

$ cat test.fq | perl mergelines.pl | sort --stable -k1,1 | perl splitlines.pl > sorted
$ head sorted
@42EBKAAXX090828:6:100:1699:328/2
@42EBKAAXX090828:6:10:1077:1883/2

as you can see it thinks that "@42EBKAAXX090828:6:100:1699:328/2" comes before "@42EBKAAXX090828:6:10:1077:1883/2"

ADD REPLY
0
Entering edit mode

It's an alphanumeric sort, so it is treating the query names as ASCII values. The sort you're seeing is because 42EBKAAXX090828:6:100:1699:328/2 has a '0' in the same position that 42EBKAAXX090828:6:10:1077:1883/2 has a ':', and ':' comes after '0'. Samtools appears to split the query name by ':' and sort, which will be different than unix sort output. Use this command: sort -t ':' -k 1,1 -k2,2n -k3,3n -k4,4n -k 5,5 (split by ':', numeric sort on fields 2,3,4, alphanumeric sort on 1 and 5) if you want to use samtools sort, or you can just use unix sort on both files.

ADD REPLY
0
Entering edit mode

@Mitch: is your sort command equivalent to using -V to "sort"? thanks.

ADD REPLY
0
Entering edit mode

I looked in my local unix sort and in samtools sort and I don't see -V anywhere. What sort binary are you using that has that option?

ADD REPLY
3
Entering edit mode
13.0 years ago
Rm 8.3k

Follow this "seqanswers" thread to extract unmapped fastqs using original Fastq and bam file (especially from output from tophat or other programms which does not include unmapped reads in bam file)

to extract unmapped reads from bam file (from newer versions of bwa etc...) which includes unmapped reads follow this.

ADD COMMENT
0
Entering edit mode

it seems like the perl script in the first forum link would be inefficient memory wise, since it has to load all the IDs into memory - no?

ADD REPLY
0
Entering edit mode

I dont think IDs list will eat up much of the memory...I have used the script it runs pretty fast...

ADD REPLY
2
Entering edit mode
13.0 years ago
Rlong ▴ 340

One approach would be to sort your bam by read name:

samtools sort -n your.bam

[EDIT: thanks to unknown] The fastq will also need to be sorted by read_names for this to work.

Open a samtools view of the bam in a pipe, and step through the fastq file record by record. If the read-name of your current record sorts to before the current read_name in your samtools view pipe, then that fastq line is not in your bam. If they match, you have a read that exists in both files. If the the read_name from your fastq file sorts to after the current read_name in your samtools view pipe, you have reads in your bam and not in your fastq.

ADD COMMENT
1
Entering edit mode

Doesn't the FASTQ have to be sorted too for this to work?

ADD REPLY
0
Entering edit mode

What is the computational complexity of the algorithm samtools sort -n uses?

ADD REPLY
0
Entering edit mode

Weird, I had to go to the source to find this. Not sure if I just didn't hit right search terms or what. Samtools sort looks like it leans on Heng Li's implementation of mergesort, his ks_mergesort, located in ksort.h.

ADD REPLY
0
Entering edit mode

How can I sort the FASTQ file by ID? I'd prefer to use Python... can BioPython do this??

ADD REPLY
0
Entering edit mode

I still think this approach doesn't work. If you have repetitions in the BAM file, how could it work?

ADD REPLY
0
Entering edit mode

also, if sorting with unix command sort, do i have to use sort -n?

ADD REPLY
0
Entering edit mode

I'm not sure about unix sort. Indeed it would have to mimic the sorting of samtools sort. Regarding the repetitions in the bam, that shouldn't matter. Let's say you have a read name from your fastq, for simplicity we'll say it is "aab". So if you have two "aab" reads in your bam, which you certainly will if this is paired end data, then when you hit "aab" in your fastq, three things can happen. If the last read name from the bam is "aaa", grab another read from the bam pipe. If this is also "aaa", draw another. If you get "aac", there are no "aab" reads in the bam. --- continued ---

ADD REPLY
0
Entering edit mode

if you get "aab" from the bam, you grab another read from the fastq. Suppose it is "aac". Then you will compare "aab" with "aac" and draw another read from the bam pipe. If the next read is "aab" as well, no problem, you'll keep drawing them until you get "aac" or something beyond it.

ADD REPLY
0
Entering edit mode

I can't get unix "sort" to mimic the samtools sort -n.... any ideas on this?

ADD REPLY
0
Entering edit mode

when do you advance to next read?

ADD REPLY

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6