Question

How to extract the longest CDS from Homo_sapiens.GRCh38.cds.all.fa?

0

Entering edit mode

3.8 years ago

Enhancer • 0

I downloaded this version of the All Human coding sequence (Homo_sapiens.GRCh38.cds.all.fa) from Ensembl but could not filter it. I initially thought I could find scripts online but I have not come across any so far; I need help on the following: (1) I want to extract the longest CDS transcript from Homo_sapiens.GRCh38.cds.all.fa (2) remove the pseudogenes

I want to use the final output_file as a reference database to find orthologous coding sequence in other mammalian taxon.

I am a beginner in bioinformatic, especially big data, but I can find my way around Ubuntu and Vagrant VM.

Thank you

python script biopython ensembl • 1.5k views

ADD COMMENT • link 3.7 years ago by Enhancer • 0

1

Entering edit mode

See if one of the answers here helps: How to extract the longest isoform from multi fasta file

You could also look at the MANE project as an alternative to get One high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene.

ADD REPLY • link 3.8 years ago by GenoMax 147k

0

Entering edit mode

Isn't MANE project incomplete yet? i actually looked at GRCh38.v0.93.select_ensembl_rna.fna.gz from the MANE, it's having less than 18, 000 cds, fewer than I would have thought.

Concerning the link "How to extract the longest isoform from multi fasta file"; after running the code, all I could get was a trimmed header of all the initial cds in the input file as shown below:

ENST00000000233 10.cds chromosome:GRCh38:7:127588411:127591700:1 gene:ENSG00000004059 ENST00000000412 8.cds chromosome:GRCh38:12:8940361:8949645:-1 gene:ENSG00000003056 ENST00000000442 11.cds chromosome:GRCh38:11:64305524:64316743:1 gene:ENSG00000173153 ENST00000001008 6.cds chromosome:GRCh38:12:2794970:2805423:1 gene:ENSG00000004478 ENST00000001146 7.cds chromosome:GRCh38:2:72129238:72147862:-1 gene:ENSG00000003137 ENST00000002125 9.cds chromosome:GRCh38:2:37231658:37249160:1 gene:ENSG00000003509 ENST00000002165 11.cds chromosome:GRCh38:6:143494812:143511720:-1 gene:ENSG00000001036 ENST00000002501 11.cds chromosome:GRCh38:16:90004871:90019456:-1 gene:ENSG00000003249 ENST00000002596 6.cds chromosome:GRCh38:4:11393150:11428894:-1 gene:ENSG00000002587 ENST00000002829 8.cds chromosome:GRCh38:3:50155324:50189075:1 gene:ENSG00000001617 ENST00000003084 11.cds chromosome:GRCh38:7:117480025:117668665:1 gene:ENSG00000001626 ENST00000003100 13.cds chromosome:GRCh38:7:92112153:92134477:-1 gene:ENSG00000001630 ENST00000003302 8.cds chromosome:GRCh38:11:113797874:113875570:-1

ADD REPLY • link 3.8 years ago by Enhancer • 0

score 0 · Answer 1 · 2021-03-10

0

Entering edit mode

3.7 years ago

Enhancer • 0

This link could help someone later. However, ensure to modify and adapt to the nature of your file, especially the sequence header.

ADD COMMENT • link 3.7 years ago by Enhancer • 0