Pair reads with BAM headers and number of mapped reads
1
0
Entering edit mode
8.0 years ago
jvuo ▴ 20

Hi,

I'm trying to combine the FASTA headers from my reads file with the matched headers from my BAM file and the numbers of mapped reads. I have a FASTA file with 15k reads that I aligned against an other FASTA file containing 1M markers.

I extracted the mapped reads using:

samtools view -F4 markers.sorted.bam > markers.mapped.bam

But running:

samtools view -H markers.mapped.bam

returns a lot more seqs than expected, I probably don't fully grasp the BAM format.

@HD     VN:1.0  SO:coordinate
@SQ     SN:gi|345004010|ref|NC_015954.1|:c1336247-1335993       LN:255
@SQ     SN:gi|345004010|ref|NC_015954.1|:2775135-2775383        LN:249
@SQ     SN:gi|345004010|ref|NC_015954.1|:c1686708-1686454       LN:255
@SQ     SN:gi|345004010|ref|NC_015954.1|:c419000-417852 LN:1149
@SQ     SN:gi|345004010|ref|NC_015954.1|:c488066-486981 LN:1086
@SQ     SN:gi|345004010|ref|NC_015954.1|:2347131-2347793        LN:663
.. etc ..

The mapping was done with GraphMap and output as SAM I converted it to BAM via:

samtools view -bS markers.sam > markers.bam

Finally I'm looking for a way to count the number of mapped reads per marker.

Thanks.

Assembly BAM SAMTools • 2.2k views
ADD COMMENT
0
Entering edit mode

I find it very upsetting when people ask questions about data but do not share the data. It is like being told about a party i'm not invited too.

Please post the header, or a snippit of it :)

EDIT: Also would be nice to know what tool did the mapping, although i hope it will be obvious from the header.

ADD REPLY
1
Entering edit mode

Oh dear, he knows about the party :-/

ADD REPLY
0
Entering edit mode

Hahah - i'll bring a data salad and a 6-pack of regex beer.

ADD REPLY
1
Entering edit mode

Could you give me time.sleep(300), I'm reading the man pages.

ADD REPLY
1
Entering edit mode

Hahaha, i can't believe such a publication exists! And copyright no less. You can't tell other bioinformaticians these secrets without attribution!

ADD REPLY
0
Entering edit mode

Let's organize a biostars new years party. Can you make && make install the invitations?

ADD REPLY
1
Entering edit mode

I updated the post with some sample data

ADD REPLY
1
Entering edit mode

Nice one - so yeah as abascalfederico says, you should have around 1 million 'contigs' in there, which isn't giong to be a bag of fun to deal with in anything other than SQL. I wrote a tool that might help, but I have also never asked it to deal with a million anythings before, so we'll see. The good news is that it's easy to use.

  1. Download SeQC from https://github.com/JohnLonginotto/SeQC and unzip it to a folder of junk.
  2. From the terminal run python /path/to/SeQC/SeQC.js.py --analysis RNAME --input /path/to/your.bam
  3. Wait. Perhaps check your PC's RAM usage in htop if you have it installed.
  4. Before it finishes it should say something like Checking index on 633a3b0d8ab37f7cac048fa31d8ca554_RNAME
  5. After it's done they'll be a file called myProject.SeQC, run sqlite3 /path/to/myProject.SeQC
  6. In there type, and don't forget the dots, .headers on, then .mode columns, and finally:
  7. SELECT * from '633a3b0d8ab37f7cac048fa31d8ca554_RNAME' order by counts DESC limit 10;

Of course you can then write it out to CSV or whatever you want, but that should get you started.

Oh, if your BAM isn't already filtered, used --analysis RNAME FLAG, and then we can filter it in SQL

ADD REPLY
1
Entering edit mode

And here I was hoping for an easy fix. I will try your tool tomorrow and report back. Thanks.

ADD REPLY
0
Entering edit mode

Did you ever get it to work? :)

ADD REPLY
2
Entering edit mode
8.0 years ago
abascalfederico ★ 1.2k

With samtools -H you are asking for the header of the bam file. In that header one of the things shown is the list of all the contigs you aligned to. If you aligned against a file with 1M sequences, I'd expect 1M "@SQ" entries in the BAM header. Is that what you get?

ADD COMMENT
0
Entering edit mode

It does:

samtools view -H markers.mapped.bam | wc -l

returns a little over a 1M lines. How can I get only the mapped headers?

ADD REPLY
0
Entering edit mode

The "mapped headers" (i.e. the names of the reads that were mapped) can be obtained with:

samtools view markers.mapped.bam | cut -f1

If you needed a non-redundant list:

samtools view markers.mapped.bam | cut -f1 | sort | uniq > unique.list

Is that what you wanted?

ADD REPLY

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6