Entering edit mode
2.2 years ago
yoser4
▴
10
Hello all,
I found 14 snp sites that I'm interested in that fall in the CDS region (I have a vcf file containing snp site information), and I'm trying to use bcftools csq to analyze the effect of gene mutations on protein levels.
The code is as follows:
bcftools csq -f csq.fa -g csq.gff3 csq.vcf > csq.out
But here I have a problem. The reference gene (fa) and annotation file (gtf) I used are from ncbi, so I downloaded the gff file from ncbi, and bcftools csq reads the gff3 file of ensemble type by default, and an error is generated here:
Parsing GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.gff ...
ignored: chr1 RefSeq region 1 278617202 . + . ID=NC_056054.1:1..278617202;Dbxref=taxon:9940;Name=1;breed=Rambouillet;chromosome=1;collection-date=2015-03-01;dev-stage=adult;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;strain=OAR_USU_Benz2616;tissue-type=Blood%2C lung%2C liver
ignored: chr1 Gnomon pseudogene 42249 46639 . - . ID=gene-LOC114110831;Dbxref=GeneID:114110831;Name=LOC114110831;gbkey=Gene;gene=LOC114110831;gene_biotype=pseudogene;pseudo=true
[csq.c:716 gff_id_parse] Could not parse the line, "Parent=transcript:" not present: chr1 Gnomon exon 42249 43660 . - . ID=id-LOC114110831;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=LOC114110831;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 16%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
I guess it is because the gff of ncbi and the gff of ensenbl are in different formats. Here I would like to ask:
- How should I modify my gff file from ncbi to make it into ensemble format
- Or, does anyone know how ensemble generates the gff3 file, I can make it myself
- Or, does anyone know if there are other methods to analyze the effect of gene mutation on protein level
Could you display the line containing
ID=id-LOC114110831;Parent=gene-LOC114110831
and the surrounding lines?Are you talking about the content of the file? it looks like this
From the manual:
So yes the easiest would be to shift to use the GFF from ensembl => http://www.ensembl.org/info/data/ftp/index.html/
Thank you, Juke. But the problem I encountered is that ensembl does not currently update the reference genome and index of sheep ramb2.0. Currently I can only download GFF files to v2.0 from ncbi. Because my project requires me to use version 2.0. This leaves me with no choice.
I think my only solution at the moment is to convert the format. I'm new to bioinformatics and I don't know what I should do now
You might find some help using AGAT. e.g. you could use
agat_sp_separate_by_record_type.pl
and then analyse only protein coding genes.sorry for the late reply. I tried the method you mentioned. But there are a series of errors. I think it's still because the GFF file format of the ramb v2.0 version I'm using is so special that AGAT doesn't recognize it.
---------------------------- Check1: feature types ----------------------------- ----------------------------------- ontology ----------------------------------- INFO - Feature types not expected by the GFF3 specification:
------------------------------ Check2: duplicates ------------------------------ None found ------------------------------ done in 0 seconds -------------------------------
-------------------------- Check3: sequential bucket --------------------------- None found ------------------------------ done in 0 seconds -------------------------------
--------------------------- Check4: l2 linked to l3 ---------------------------- 3627 cases fixed where L3 features have parent feature(s) missing ------------------------------ done in 1 seconds -------------------------------
--------------------------- Check5: l1 linked to l2 ---------------------------- 119 cases fixed where L2 features have parent features missing ------------------------------ done in 0 seconds -------------------------------
--------------------------- Check6: remove orphan l1 --------------------------- We remove only those not supposed to be orphan 5914 cases removed where L1 features do not have children (while they are suposed to have children). ------------------------------ done in 1 seconds -------------------------------
------------------------- Check7: all level3 locations ------------------------- ------------------------------ done in 23 seconds ------------------------------
------------------------------ Check8: check cds ------------------------------- No problem found ------------------------------ done in 0 seconds -------------------------------
----------------------------- Check9: check exons ------------------------------ 13 exons created that were missing No exons locations modified No supernumerary exons removed No level2 locations modified ------------------------------ done in 36 seconds ------------------------------
----------------------------- Check10: check utrs ------------------------------ 166991 UTRs created that were missing No UTRs locations modified No supernumerary UTRs removed ------------------------------ done in 30 seconds ------------------------------
------------------------ Check11: all level2 locations ------------------------- No problem found ------------------------------ done in 22 seconds ------------------------------
------------------------ Check12: all level1 locations ------------------------- We fixed 49 wrong level1 location cases ------------------------------ done in 1 seconds -------------------------------
---------------------- Check13: remove identical isoforms ---------------------- None found ------------------------------ done in 85 seconds ------------------------------
=> OmniscientI total time: 519 seconds GFF3 file parsed Job done in 609 seconds
There are two main types of errors (similar to them appear so many times that I can't list them all): WARNING level1: This feature level1 is not a duplicate but has an ID already used (original id: 5772b24a-4ae1-4a4e-9b96-a403ef75d22a). AGAT does not deal with that currently. @ the feature is: NC_056054.1 RefSeq cDNA_match 147143 148998 . - . Gap "M675 I1 M152 D1 M48 I2 M694 D1 M285" ; ID "nbis-cdna_match-1" ; Target "XM_042239135.1 1456 3312 +" ; for_remapping 2 ; gap_count 4 ; num_ident 3309 ; num_mismatch 0 ; pct_coverage "99.9094" ; pct_coverage_hiqual "99.9094" ; pct_identity_gap "99.8491" ; pct_identity_ungap 100 ; rank 1 Indeed we changed the ID for this feature l1 to be uniq but we do not change the parent attribute for the subfeatures because we do not know to which L1 they are really linked to. As it is now we will probably end up with chimeric records.
3 warning messages: WARNING level1: This feature level1 is not a duplicate but has an ID already used (original id: f4c81044-5a57-42df-ae1f-5deecdf5a598). AGAT does not deal with that currently.
Sounds fine as long as it touch only features you do not care e.g cDNA_match.
c_gene_segment.gff
lnc_rna.gff
mrna.gff
region.gff
rrna.gff
snorna.gff
transcript.gff
v_gene_segment.gff guide_rna.gff
mirna.gff
primary_transcript.gff rna.gff
sequence_feature.gff
snrna.gff
trna.gff
Thanks. Currently I get the following output file. But I didn't find an explanation on what these output files mean on the AGAT website. So that I don't know what some output files represent, what output files can I use? Where can I see an explanation about these output files? I think I'm more interested in CDS
Those are the different type of features described in your original file. Only mrna.gff and maybe transcript.gff should contain CDS features.
thank you , Juke