I downloaded this version of the All Human coding sequence (Homo_sapiens.GRCh38.cds.all.fa) from Ensembl but could not filter it. I initially thought I could find scripts online but I have not come across any so far; I need help on the following: (1) I want to extract the longest CDS transcript from Homo_sapiens.GRCh38.cds.all.fa (2) remove the pseudogenes
I want to use the final output_file as a reference database to find orthologous coding sequence in other mammalian taxon.
I am a beginner in bioinformatic, especially big data, but I can find my way around Ubuntu and Vagrant VM.
Thank you
See if one of the answers here helps: How to extract the longest isoform from multi fasta file
You could also look at the MANE project as an alternative to get
One high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene
.Isn't MANE project incomplete yet? i actually looked at GRCh38.v0.93.select_ensembl_rna.fna.gz from the MANE, it's having less than 18, 000 cds, fewer than I would have thought.
Concerning the link "How to extract the longest isoform from multi fasta file"; after running the code, all I could get was a trimmed header of all the initial cds in the input file as shown below: