How to extract the longest CDS from Homo_sapiens.GRCh38.cds.all.fa?
1
0
Entering edit mode
3.8 years ago
Enhancer • 0

I downloaded this version of the All Human coding sequence (Homo_sapiens.GRCh38.cds.all.fa) from Ensembl but could not filter it. I initially thought I could find scripts online but I have not come across any so far; I need help on the following: (1) I want to extract the longest CDS transcript from Homo_sapiens.GRCh38.cds.all.fa (2) remove the pseudogenes

I want to use the final output_file as a reference database to find orthologous coding sequence in other mammalian taxon.

I am a beginner in bioinformatic, especially big data, but I can find my way around Ubuntu and Vagrant VM.

Thank you

python script biopython ensembl • 1.5k views
ADD COMMENT
1
Entering edit mode

See if one of the answers here helps: How to extract the longest isoform from multi fasta file

You could also look at the MANE project as an alternative to get One high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene.

ADD REPLY
0
Entering edit mode

Isn't MANE project incomplete yet? i actually looked at GRCh38.v0.93.select_ensembl_rna.fna.gz from the MANE, it's having less than 18, 000 cds, fewer than I would have thought.

Concerning the link "How to extract the longest isoform from multi fasta file"; after running the code, all I could get was a trimmed header of all the initial cds in the input file as shown below:

ENST00000000233 10.cds chromosome:GRCh38:7:127588411:127591700:1 gene:ENSG00000004059 ENST00000000412 8.cds chromosome:GRCh38:12:8940361:8949645:-1 gene:ENSG00000003056 ENST00000000442 11.cds chromosome:GRCh38:11:64305524:64316743:1 gene:ENSG00000173153 ENST00000001008 6.cds chromosome:GRCh38:12:2794970:2805423:1 gene:ENSG00000004478 ENST00000001146 7.cds chromosome:GRCh38:2:72129238:72147862:-1 gene:ENSG00000003137 ENST00000002125 9.cds chromosome:GRCh38:2:37231658:37249160:1 gene:ENSG00000003509 ENST00000002165 11.cds chromosome:GRCh38:6:143494812:143511720:-1 gene:ENSG00000001036 ENST00000002501 11.cds chromosome:GRCh38:16:90004871:90019456:-1 gene:ENSG00000003249 ENST00000002596 6.cds chromosome:GRCh38:4:11393150:11428894:-1 gene:ENSG00000002587 ENST00000002829 8.cds chromosome:GRCh38:3:50155324:50189075:1 gene:ENSG00000001617 ENST00000003084 11.cds chromosome:GRCh38:7:117480025:117668665:1 gene:ENSG00000001626 ENST00000003100 13.cds chromosome:GRCh38:7:92112153:92134477:-1 gene:ENSG00000001630 ENST00000003302 8.cds chromosome:GRCh38:11:113797874:113875570:-1

ADD REPLY
0
Entering edit mode
3.7 years ago
Enhancer • 0

This link could help someone later. However, ensure to modify and adapt to the nature of your file, especially the sequence header.

ADD COMMENT

Login before adding your answer.

Traffic: 2037 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6