also I have tried to use cufflinks's gffread like this: gffread file.gff3 -g genome.fa -y proteins.fa but the proteins sequence are nonsense, although the lengths are consistent.. (Am I doing something wrong?)
edit: as well as bedtools getfasta, but I get each element's sequence separately (see my comments below)
Isn't there something more simple and reliable than the above custom scripts?
I did, see my answer to WouterDeCoster below. I cannot get per-transcript CDS sequences, I just get each element (gene, transcript, start/end codons, and each CDS fragment) separately. Is this the expected result?
Are you happy with R/Bioconductor? There is a solution here which could solve your problem. Look for item #6 of 'Alignments (and genomic annotations)'. I will transcribe the relevant (but incomplete) bits of code here for convenience, but all credits go to Martin Morgan.
## cds for all transcripts
cdsByTx <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")
## reference genome
library(BSgenome.Hsapiens.UCSC.hg19)
dna <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, cdsByTx)
protein <- translate(dna)
Assuming the gff3 is well formatted, that is, contains the CDS information, yes. Of course, when it comes to bioinformatics assuming a file is well formatted and that things will go as expected it a big ask so ymmv.
Thanks, but actually, I have also tried getfasta. And I can only get the sequences of each element separately.
Would the -split option do what I want? Apparently it will only work if the input is BED. How do I convert my GFF3 to BED? Which other tool do I have to use besides getfasta?
edit: The converter is bedops's gff2bed, I suppose. It didn't help, though (same result).
I can do that to have all the CDS and just them (that's part of the "shell tricks" in my answer below), but then I would have to piece them together. Definitely possible but not simple either.
As I understand the question of n.caillou, he wants to extract coding sequences, not transcript sequence (i.e. he needs to paste together exons for multi-exon genes)...
The xtractore program from the AEGeAn Toolkit should work for this. Set --type=CDS on the command line. For CDSs and other discontiguous features, xtractore pastes the fragments together correctly, taking into account strand, etc.
I used xtractore to extract CDS from a gff file. It does extract CDS taking into account strand and so on, but it does not paste CDS fragments belonging to the same gene together. Am I missing sth?
Also, the -w option does not seem to work properly:
Setting -w to zero gives me an empty output fasta file. In the documentation it is mentioned:
"-w|--width: INT width of each line of sequence in the Fasta
output; default is 80; set to 0 for no
formatting"
Anyways, thanx for your program, it would take 2-3 days to write a CDS extractor myself :P
I'm not sure I understand your constraints. If you are uncomfortable evaluating custom Perl/Python scripts and if installing new software is impractical for you, you may be in the wrong field!
I'm not trying to be witty or smug, I'm being completely frank. The bioinformatics state of the art unfortunately provides very few clean solutions; almost always you must get your hands a bit dirty to piece together a working solution.
You can also get complete sequences of (converted) proteins using prot instead of using CDS or cDNAs using cDNA or genes using gene.
When you get the sequence for each CDS separately for instance with bedtools getfasta, first of all you get each piece of a CDS separately rather than getting the CDS sequences of a gene combined, and second for genes with isoforms, you can get the same region more than once in overlapping sequences which is not good if you are trying to analyze the coding region of a genome. As far as I managed to understand, this script gets the longest isoform version possible for an exon, and merges them for each gene.
The script is designed for the EVidenceModeler pipeline, so it doesn't have to work with the gffs coming from all different annotation pipelines, but so far it was working well for me, see this post if you also run into the error that I just faced with one gff from RefSeq:
https://github.com/EVidenceModeler/EVidenceModeler/issues/45
Yes :D I found this thread via google looking for an alternative when I ran into an error with the EVM script I posted, and noticed no one else posted about it. Very good and useful script, kudos to Brian indeed. I will also try AGAT next time too, thanks!
Did you try bedtools getfasta ?
I did, see my answer to WouterDeCoster below. I cannot get per-transcript CDS sequences, I just get each element (gene, transcript, start/end codons, and each CDS fragment) separately. Is this the expected result?
I see what you mean. This might help you A: Gencode Gtf To Bed12 Or Psl