I'm trying to create a piRNA count table from samples enriched with small RNAs (~31 bases) using the piRBase reference. Despite all my readings I haven't find a simple way to do that.
In the first place I tried to quantifiy piRNAs with a similar strategy as miRNAs by aligning trimmed reads with Bowtie to piRBase (instead of miRBase) but I underestimated the multimapping problem: more than 99% of the aligning reads multimap the reference even when requiring 0 mistmatch (bowtie -v 0). Indeed the drosophila piRNA reference contains >41 million sequences.
Among all the perfect matches I would be interested in prioritizing the reference sequences that have no additionnal bases.
For example the read TGCTTGGACTACATATGGTTGAGGGTTGTA perfectly matches many piRNAs in the piRBase:
For that case I would like to count only piR-dme-179 because it has no additionnal bases. Are there tools that could help efficiently do that ? Maybe using a reciprocal matching criterion.
Maybe you can create a new fasta file with your "unique" piRNA fasta sequences and use it as a "reference genome" to perform the alignment with Bowtie. Then, you can obtain counts piRNA data for example with summarizeOverlaps R function.
The other references/databases have less piRNAs but they still have sequences that are subsequences of others, therefore it doesn't solve the multi-mapping problem when using aligners like Bowtie. In most papers/pipelines I have seen they map to the genome which is worse in terms of multi-mapping and I'm not trying to discover new piRNAs. I ended up writting a custom script to quantify piRBase piRNAs (in my case >95% of the FASTQ sequences found a perfect match).
In case someone is interested, the way I did (requires at least 10GB of RAM):
Parse the piRBase FASTA to create a map:<sequence> -> <sequence name> (e.g. with a dict in Python or a list in R)
Parse the sample FASTQ:
if you have UMIs to account for PCR bias, create a map <sequence> -> seen UMIs: the count will be the number of unique UMIs
Please do not delete threads once they have received a comment/answer. While we do not encourage this you could accept your own answer to provide closure to this thread. Ideally if you can share the custom script via a GitHub gist or by including it here it would be great.
Respectfully I think you're trying to reinvent the wheel.
Sometimes other DBs have fewer sequences because they have been curated, which means all redundancy or as much redundancy as possible has been discarded. There are no perfect DBs but it does not mean they are useless.
they still have sequences that are subsequences of others
In my opinion, it does not mean they are wrong/redundant (and even less regarding pirnas). If you are not interested in non-previously described piRNAs you can use curated databases and standard pipelines, in your case probably using clustering (local/global alignments) would be a good option.
Search against the piRBase, generate the piRNA count table.
## extract the piRNAs
# big.fa.gz, is the piRBase file (eg: 42 million records)
# small.fa.gz, is the query,
$ seqkig grep -s -f <(seqkit seq -s small.fa) big.fa.gz > out.fa
## custome script
# extract the read count for each sequence from (small.fa.gz)
The piRBase including too many sequences (~42 million records for dme, v2.0); Always we have 10+ million reads for each small RNA sample. It is not a good idea to search against 42 records with many 10+ records query.
So I tried to reduce the query size as much as possible
did you try a manually curated database? as described here
Past thread that may be useful: Suggest Pirna Ngs Small Rna Analysis Pipelines/Workflows
Maybe you can create a new fasta file with your "unique" piRNA fasta sequences and use it as a "reference genome" to perform the alignment with Bowtie. Then, you can obtain counts piRNA data for example with summarizeOverlaps R function.