How to reduce cross mapping alignments with paired-end reads ?
3
0
Entering edit mode
7.2 years ago
David ▴ 240

Hi,

I´m trying to align paired-end reads (illumina) coming from a metagenomics experiment to a custom database of bacterial species.

I´m using bwa-mem (althought could change program if required) to align reads to my genomes. It works pretty well but i have some reads mapping to two different bacterial species. I know this is normal and occurs because those genomes a very close related.

In my case i have Species-A mapped by 2200 reads where as close related speciesB is mapped by 800reads. Only speciesA is present in my sample so clearly speciesB is a false positive.

Is there any way to try to reduce false positives by tuning bwa in order to minimize the impact of cross mapping reads ??

thanks , david

bwa genome dnaseq • 2.6k views
ADD COMMENT
1
Entering edit mode

In case you are absolutely sure than B is not in the sample, why not removing B from the database?

ADD REPLY
0
Entering edit mode

Agree with ATPoint , why do you align against Species B ref if it's not possible to find it ?

You can still change your parameters to filter reads with Mismatch penalty , but you will got less aligned reads in global mapping.

ADD REPLY
0
Entering edit mode

I´m testing my pipeline with a known MOCK but the idea is to apply the same pipeline to unknown metagenomic samples so removing species from my database makes no sense.

ADD REPLY
1
Entering edit mode
7.2 years ago
GenoMax 147k

BBSplit from BBMap is designed for this type of application. Check this SA thread for information. You can choose how to handle the multi-mapping reads (ones that map across genomes). We have used this tool with a pool of 20 species with great results.

ADD COMMENT
0
Entering edit mode
7.2 years ago

The choice of aligner should be made based on the read lengths that you've got. BWA mem is optimised for reads > 70bp. If you've got shorter reads, then you should consider bowtie/bowtie2 or the older BWA algorithm.

If you wanted to attempt to increase specificity of mapping for the species that you know was in your sample(s), then you could eliminate reads that are below a certain length (say, 70bp), which can be done using cutadapt or Trim Galore!, and then perform the alignment. Longer read lengths result in more specific matching, of course.

Also it would be useful to run FastQC on your reads in order to see how the base quality looks across the reads. If it's poor base quality, then they'll map to dozens of species even outside your database due to false base calls. Generally, you should only have bases with Phred-scaled qualities > 20 or 30 at the read ends.

With bowtie/bowtie2, you can also add a useful set of parameters to ensure that only uniquely-mapped reads are retained. Taking it direct from my notes:

Use -m 1 and --best options to only keep uniquely matched reads (and the one with the 'best' MAPQ)

A few things to consider and try.

Kevin

ADD COMMENT
1
Entering edit mode

Thanks for the update kevin,

It´s a 2x150bp experiment. I´ve done all the above trimming and QC on the reads prior to mapping them on the database.

Thanks for the bowtie2 notes , i might consider give it a try.

david

ADD REPLY
0
Entering edit mode

Did you check if your pairs map on the same ref ?

ADD REPLY
0
Entering edit mode
7.2 years ago
David ▴ 240

Yes, they both map, genomes are very close so i suspect there is no real solution ???

ADD COMMENT
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This comment should go up against @Titus's.

ADD REPLY
0
Entering edit mode

Yes there is plenty of solution :

  • You can make a filter on the % of coverage of your genomes ( in your case you will get only this gene covered on your Species B reference isn't it ? )

  • You can select for each ref of your species only specific part etc ...

The solution will depends on what you want at the end.

By the way i m not sure "false positive" in your title are appropriate if your sequences are identical...

ADD REPLY
0
Entering edit mode

Yes , that´s what i thought, working with a certain % coverage, it makes a lot of sense ??

Is there any database that list all ncbi genome lengths for bacterial ?? or should i create it myself?

Agree with the title, might not be appropriate

ADD REPLY
0
Entering edit mode

Well it make sense if in your genome you have only this gene covered , depends on the genome size so in % coverage etc ...

I thinks genome lengths bacterial are available (for the most popular). If you have fasta you can simply calculate the length :)

ADD REPLY

Login before adding your answer.

Traffic: 1860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6