Entering edit mode
8 months ago
O.rka
▴
740
https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=Phypa1_1
I have 3 files:
- Assembly: Physcomitrella_patens.1_1.allmasked.gz
- GFF: Phypa1_1.FilteredModels.gff.gz
- Proteins: proteins.Phypa1_1.FilteredModels.fasta.gz
There are no CDS sequences so I'm going to try and recreate the protein file and also the CDS file. However, when I use agat to extract the sequences, I get nonsense:
seqkit seq Physcomitrella_patens.1_1.allmasked.gz > Physcomitrella_patens.1_1.allmasked.fasta
agat_sp_extract_sequences.pl -f Physcomitrella_patens.1_1.allmasked.fasta -g Phypa1_1.FilteredModels.gff.gz -c agat_config.yaml -p -o test_proteins.fasta
My proteins are full to stop codons:
>agat-rna-1 gene=agat-gene-15249 seq_id=scaffold_1 type=cds
*SSKLQKHRAQVEHSVAHVHA*M*RWDFFGRGLGYGEEARRRNSTSFISHCRRDALQNFY
HSFTFEMRTKSVPTGDHTPEGT*DSYPLRLMGAKKAW*SVSKFTVVSHLR*QMQTLNAGT
QSRVIVGRRNNSVTELLHGMEIN*WFSTQSVRSLVAAKLQISSGGRKIRLR*ASGLILFP
LIG*Q*T*EIRLLFLPVINPLQALRPSPKSTLGKDMIHCEGQ*LLQADVMI*KRNYISMM
QVQ*LEYHCGSFRAIELLLATHLIP*WQPVQLIGA
The actual sequence for this record should be the following:
>jgi|Phypa1_1|63627|fgenesh1_pg.scaffold_1000001
MKFKAAKAQSPSGTFCGSCACMNVKMGFFWTGVGLWGRSKEEKQHKLHKSLSKRCIAEFL
PQFHIRDADEVRSNRRPYTGGDVRLLPTEVNGGEEGLVICLQVHRSLPSSVADADSECWD
AIPRYRWKKEQLSHRVVARDGNQLMVFDAICEVTGCCKAANIFRRSEDSVKVSFRLDFIS
AYRVTMNLRNSIVVPPCDQSTASITTLSEIHSRQRHDPLRGSVIAASRCHDLKKELHFHD
ASPITRVPLWQLSGHRVASCNPSHSLMAASSVNWS*
The genetic code is trans_table 1
Are you sure the fasta file is sync with the annotation file? I.e the annotation gff/gtf has been done using this fasta?
It's difficult to be 100% sure but there's only one genome assembly. I've updated the question with the screenshot.