We are looking for pathogen miRNAs in host tissues among reads that couldn’t map to the host genome. The relevant commands are below;
mapper.pl "$file"_HOST_UNMAPPED.fq -e -h -m -p $PathogenGenomeBasename -t "$file"_HOST_UNMAPPED_vs_PATHOGEN.arf -v -s "$file"_HOST_UNMAPPED_vs_PATHOGEN_COLLAPSED.fa
miRDeep2.pl "$file"_HOST_UNMAPPED_vs_PATHOGEN_COLLAPSED.fa $PathogenGenomeBasename.fa HOST_UNMAPPED_vs_PATHOGEN.arf $PathogenMatureMiRNA $PathogenMiRNA $PathogenPrecursorMiRNA -c -d -g -1
According to this analysis resulting from the above, we are getting miRNAs in our uninfected controls. When I look into the log file, I can see the following for the mapper.pl stage;
parsing fastq to fasta format
discarding short reads
collapsing reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_9529
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 2290436 27708 2262728 1.210 98.790
seq: 2290436 27708 2262728 1.210 98.790
And this for the miRDeep2.pl stage;
Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 2290420 24913 2265507 1.088 98.912
seq: 2290420 24913 2265507 1.088 98.912
Surely the total reads for miRDeep2.pl to start with should be 27,708, and not 2,290,420? If so, could anyone advise how to adjust the above commands to fix this? My understanding is that mirdeep2.pl takes the reads that mapper.pl maps, and tries to identify miRNAs from those mappings if they satisfy the various criteria. Is this correct?
My assumption is that with millions of reads going into the miRDeep2.pl stage (instead of the ~20,000), this is generating lots of false positives.
Your input into
miRDeep2.pl
is just the processed reads, e.g. those filtered for one reason or another. This does not include any information about whether they were properly mapped or not.Ah ok, I thought that the collapsed reads fasta only included reads that had mapped. What options can I use to generate a fasta of mapped reads following mapper.pl? Is it possible to extract mapped reads in a fasta format from the arf file?
I'm not sure it matters, since it'll only likely change the %mapped. The information you get out of the miRDeep2 pipeline will be identical other than these QC metrics.