How to filter .fa file by gene_biotype
0
0
Entering edit mode
2.5 years ago
bioinfo ▴ 150

I have a fasta file I am planning to use to make a kallisto index. I would like to filter it and remove anything with gene_biotype and transcript_biotype that is rRNA.

I tried to import my fasta file in R using read.fasta however it becomes a list and I cannot filter the file. Is there an easy way to filter the file? Ensembldb seems very useful however I do not want to download the file from Ensembl. I want to import the file that I already have because I have merged the cDNA and ncrna from Ensembl and then use the new file to make a kallisto index.

Thank you

rna-seq kallisto • 921 views
ADD COMMENT
1
Entering edit mode

When creating a kallisto index using the recommended approach (which is kb-python), when running kb ref you can use --exclude-attribute to exclude certain features from the reference (e.g. --exclude-attribute gene_biotype:lincRNA will exclude lincRNAs).

You should filter out things that are causing multimapping (like read-through transcripts) and/or things that should never be found in your RNA-seq reads (since kallisto's index grows a lot in memory with the more things you add in it).

If there is a lot of rRNA contamination, it's probably not the best idea to filter them out at this stage as explained in the other thread.

ADD REPLY
0
Entering edit mode

This may be a duplicate of How to create kallisto index that does not include ribosomal genes, or the thread may be helpful anyways. Main question is: why do you want to filter rRNA at this stage - it is likely better to do this at a later stage in your experiment if required.

ADD REPLY
0
Entering edit mode

Multiple people in this and the previous thread suggested not to remove rRNA, so it is on you to either follow that or not. For technical purposes, since you mention R, reading a fasta is best done with readDNAStringSet from Biostrings which creates an easy-to-manipulate object. Easiest is probably to get the names of transcripts you aim to remove from a GTF as this explicitely contains the biotype, and then remove that from the DNAStringSet, see also https://www.rdocumentation.org/packages/Biostrings/versions/2.40.2/topics/XStringSet-io

ADD REPLY

Login before adding your answer.

Traffic: 2331 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6