Question

Using transdecoder to filter a fasta file with preserving whole transcripts

0

Entering edit mode

3.2 years ago

poecile.pal ▴ 50

Good evening,

I have .fasta file transcripts_longestiso.fasta (output of rnaSPAdes) that looks like this

>NODE_1_length_24937_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAA...
>NODE_7_length_14915_cov_350.128521_g1_i0
AAATAGAAGGGTCT...

etc

I need to discard transcripts with ORF length less than 100 amino acids (300 nucleotides). To do this, I launched

TransDecoder.LongOrfs -t transcripts_longestiso.fasta

I got longest_orfs.pep containing the predicted amino acid sequences, if I understand correctly. Like this:

>NODE_1_length_24937_cov_13085.239743_g0_i0.p1 type:complete len:247 gc:universal NODE_1_length_24937_cov_13085.239743_g0_i0:9145-9885(+)
MRPLEAP...
>NODE_1_length_24937_cov_13085.239743_g0_i0.p2 type:complete len:247 gc:universal NODE_1_length_24937_cov_13085.239743_g0_i0:15793-15053(-)
MRPLE...

etc

And longest_orfs.gff3

NODE_1_length_24937_cov_13085.239743_g0_i0      transdecoder    gene    1       24937   .       +       .
       ID=GENE.NODE_1_length_24937_cov_13085.239743_g0_i0~~NODE_1_length_24937_cov_13085.239743_g0_i0.p1;Name=ORF%20type%3Acomplete%20len%3A247%20%28%2B%29
NODE_1_length_24937_cov_13085.239743_g0_i0      transdecoder    mRNA    1       24937   .       +       .
       ID=NODE_1_length_24937_cov_13085.239743_g0_i0.p1;Parent=GENE.NODE_1_length_24937_cov_13085.239743_g0_i0~~NODE_1_length_24937_cov_13085.239743_g0_i0.p1;Name=ORF%20type%3Acomplete%20len%3A247%20%28%2B%29

And other files that are transdecoder output.

But I would like to get the same fasta file at the output, only filtered. With whole transcripts and in a nucleotide record. But those transcripts that do not meet my condition should not be included in this file.

For example, if in NODE_1_length_24937_cov_13085.239743_g0_i0 transcript transdecoder found an ORF with appropriate length, and in NODE_7_length_14915_cov_350.128521_g1_i0 - did not find it, then the output fasta file should be

>NODE_1_length_24937_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAA...

Could you please tell me how to do this?

Thank you in advance!

Best regards, Poecile

transcriptome assembly rna-seq fasta transdecoder • 1.3k views

ADD COMMENT • link updated 3.2 years ago by cpad0112 21k • written 3.2 years ago by poecile.pal ▴ 50

0

Entering edit mode

It's tl;dr and confusing.

ADD REPLY • link 3.2 years ago by cpad0112 21k

score 2 · Accepted Answer · 2022-03-04

I solved the problem on my own, and I hope this will help someone too)

1) Extract the first column with seqid from pep file (without > at the beginning and without .pN at the end)

$ sed -n '/^>/p' longest_orfs.pep > longest_orfs_ids.pep
$ cut -f1 -d ' ' < longest_orfs_ids.pep > longest_orfs_seqids_frompep.txt
$ cut -f1 -d 'p' < longest_orfs_seqids_frompep.txt > longest_orfs_seqids_withoutpN_frompep.txt
$ sed 's/^.//;s/.$//' longest_orfs_seqids_withoutpN_frompep.txt > longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep.txt

2) It turns out a lot of duplicates. Remove them.

$ sort longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep.txt | uniq > longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep_uniq.txt

3) Extract from fasta transcripts listed in longest_orfs_seqid_uniq.txt

$ seqtk subseq transcripts_longestiso.fasta ./transcripts_longestiso.fasta.transdecoder_dir/longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep_uniq.txt > transcripts_longestiso_longorfs.fasta