How to quantify piRNAs ?
3
0
Entering edit mode
3.5 years ago
Tom • 0

Hi

I'm trying to create a piRNA count table from samples enriched with small RNAs (~31 bases) using the piRBase reference. Despite all my readings I haven't find a simple way to do that.

In the first place I tried to quantifiy piRNAs with a similar strategy as miRNAs by aligning trimmed reads with Bowtie to piRBase (instead of miRBase) but I underestimated the multimapping problem: more than 99% of the aligning reads multimap the reference even when requiring 0 mistmatch (bowtie -v 0). Indeed the drosophila piRNA reference contains >41 million sequences.

Among all the perfect matches I would be interested in prioritizing the reference sequences that have no additionnal bases.

For example the read TGCTTGGACTACATATGGTTGAGGGTTGTA perfectly matches many piRNAs in the piRBase:

>piR-dme-179
TGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-183
CTGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-800
TGCTTGGACTACATATGGTTGAGGGTTGTAA
>piR-dme-46776
ATGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-48715
GCTGCTTGGACTACATATGGTTGAGGGTTGTA
...

For that case I would like to count only piR-dme-179 because it has no additionnal bases. Are there tools that could help efficiently do that ? Maybe using a reciprocal matching criterion.

I hope you can help, thank you

piRNA piRBase • 2.2k views
ADD COMMENT
0
Entering edit mode

did you try a manually curated database? as described here

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Maybe you can create a new fasta file with your "unique" piRNA fasta sequences and use it as a "reference genome" to perform the alignment with Bowtie. Then, you can obtain counts piRNA data for example with summarizeOverlaps R function.

ADD REPLY
0
Entering edit mode
3.5 years ago
Tom • 0

Thank you.

The other references/databases have less piRNAs but they still have sequences that are subsequences of others, therefore it doesn't solve the multi-mapping problem when using aligners like Bowtie. In most papers/pipelines I have seen they map to the genome which is worse in terms of multi-mapping and I'm not trying to discover new piRNAs. I ended up writting a custom script to quantify piRBase piRNAs (in my case >95% of the FASTQ sequences found a perfect match).

In case someone is interested, the way I did (requires at least 10GB of RAM):

  1. Parse the piRBase FASTA to create a map:<sequence> -> <sequence name> (e.g. with a dict in Python or a list in R)
  2. Parse the sample FASTQ:
    • if you have UMIs to account for PCR bias, create a map <sequence> -> seen UMIs: the count will be the number of unique UMIs
    • otherwise map <sequence> -> number of appearances
  3. Write the count table from these 2 mappings
ADD COMMENT
0
Entering edit mode

Please do not delete threads once they have received a comment/answer. While we do not encourage this you could accept your own answer to provide closure to this thread. Ideally if you can share the custom script via a GitHub gist or by including it here it would be great.

ADD REPLY
0
Entering edit mode

Respectfully I think you're trying to reinvent the wheel. Sometimes other DBs have fewer sequences because they have been curated, which means all redundancy or as much redundancy as possible has been discarded. There are no perfect DBs but it does not mean they are useless.

they still have sequences that are subsequences of others

In my opinion, it does not mean they are wrong/redundant (and even less regarding pirnas). If you are not interested in non-previously described piRNAs you can use curated databases and standard pipelines, in your case probably using clustering (local/global alignments) would be a good option.

ADD REPLY
0
Entering edit mode
3.1 years ago

All of the sequences linked above and annotated as piRNA are most likely 2S rRNA derived fragments and not piRNA: https://www.ncbi.nlm.nih.gov/nuccore/V00236

ADD COMMENT
0
Entering edit mode
3.1 years ago
wm ▴ 570

In my case, I use the following way to quantify piRNA counts;

  1. Collapse reads (and remove UMIs, if exists), here I have the read count
  2. (optional) remove reads by size (eg: exclude reads less than 23nt, that could be siRNAs, miRNAs, ...)
  3. (optional) remove reads by read count (eg: exclude read_count == 1)
  4. Search against the piRBase, generate the piRNA count table.
## extract the piRNAs
# big.fa.gz, is the piRBase file (eg: 42 million records)  
# small.fa.gz, is the query, 
$ seqkig grep -s -f <(seqkit seq -s small.fa) big.fa.gz > out.fa

## custome script
# extract the read count for each sequence from (small.fa.gz)

The piRBase including too many sequences (~42 million records for dme, v2.0); Always we have 10+ million reads for each small RNA sample. It is not a good idea to search against 42 records with many 10+ records query.
So I tried to reduce the query size as much as possible

ADD COMMENT

Login before adding your answer.

Traffic: 2577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6