Dear all,
I want know if there is a direct method or command which can get all the fasta sequences only for the significant genes derived by cuffdiff.
Dear all,
I want know if there is a direct method or command which can get all the fasta sequences only for the significant genes derived by cuffdiff.
Do you want the fasta of the gene or the fasta of the transcript? If you want the fasta of the gene, how are you defining that? Is it the merge of all the transcripts? The intersection? The sequence of all transcripts that are part of the gene, as separate records?
I'll talk you through transcripts for now because its easier.
The first thing that you are going to need is a list of the gene ids for the differentially expressed genes. You can get these from the transcript_exp.diff
file output by cuffdiff using awk. Something along the lines of
awk '($12<0.05) & ($6=="OK") {print $1}' transcript_exp.diff > significant_transcript_ids.txt
Now you want to get the fasta for these. If the transcript IDs are standard IDs, like ENSEMBL IDs, then Biomart is probably your best bet. If you want to do it programatically then I suggest biomaRt from bioconductor. Alternatively, you could use samtools faidx
as suggested above on the ENSEMBL fasta files.
If your transcripts IDs aren't standard transcript IDs. For example if they are transcripts you've assembled yourself using cufflinks, then things might be a bit harder. First you need a file which contains the transcripts you want. A rough and ready why of doing this would be with grep:
grep -f signifcant_transcript_ids.txt geneset.gtf > significant_transcripts.gtf
A less error prone way might be with gtf2gtf from cgat
cgat gtf2gtf -I geneset.gtf --method=filter --filter-method=transcript --map-tsv-file=significant_transcript_ids.txt > significant_transcripts.gtf
You then need to fetch the fasta sequence for these intervals. bedtools getfasta
can do that, but if you pass it a gtf file, I think it will give you a seperate fasta record for each exon. You could convert to bed12 and then use bedtools getfasta
. Alternatively use gff2fasta
from cgat
. You'll need to index your genome first with samtools faidx
or cgat index_genome
cgat gff2fasta -I signficant_transcripts.gtf --genome-file=genome.fa --is-gtf > significant_transcripts.fasta.
HI,
try to make use of samtools faidx
which can get you the fasta sequences for your signifcant genes ot bedtools getfasta
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.