Question

Getting fasta sequence of significant transcripts/genes after cuffdiff

0

Entering edit mode

8.7 years ago

dasdevashishdas ▴ 120

Dear all,

I want know if there is a direct method or command which can get all the fasta sequences only for the significant genes derived by cuffdiff.

RNA-Seq cuffdiff cufflink fasts • 2.6k views

ADD COMMENT • link updated 8.7 years ago by i.sudbery 21k • written 8.7 years ago by dasdevashishdas ▴ 120

score 3 · Answer 1 · 2016-08-26

Do you want the fasta of the gene or the fasta of the transcript? If you want the fasta of the gene, how are you defining that? Is it the merge of all the transcripts? The intersection? The sequence of all transcripts that are part of the gene, as separate records?

I'll talk you through transcripts for now because its easier.

The first thing that you are going to need is a list of the gene ids for the differentially expressed genes. You can get these from the transcript_exp.diff file output by cuffdiff using awk. Something along the lines of

awk '($12<0.05) & ($6=="OK") {print $1}' transcript_exp.diff > significant_transcript_ids.txt

Now you want to get the fasta for these. If the transcript IDs are standard IDs, like ENSEMBL IDs, then Biomart is probably your best bet. If you want to do it programatically then I suggest biomaRt from bioconductor. Alternatively, you could use samtools faidx as suggested above on the ENSEMBL fasta files.

If your transcripts IDs aren't standard transcript IDs. For example if they are transcripts you've assembled yourself using cufflinks, then things might be a bit harder. First you need a file which contains the transcripts you want. A rough and ready why of doing this would be with grep:

grep -f signifcant_transcript_ids.txt geneset.gtf > significant_transcripts.gtf

A less error prone way might be with gtf2gtf from cgat

cgat gtf2gtf -I geneset.gtf --method=filter --filter-method=transcript --map-tsv-file=significant_transcript_ids.txt > significant_transcripts.gtf

You then need to fetch the fasta sequence for these intervals. bedtools getfasta can do that, but if you pass it a gtf file, I think it will give you a seperate fasta record for each exon. You could convert to bed12 and then use bedtools getfasta. Alternatively use gff2fasta from cgat. You'll need to index your genome first with samtools faidx or cgat index_genome

cgat gff2fasta -I signficant_transcripts.gtf --genome-file=genome.fa --is-gtf > significant_transcripts.fasta.

score 1 · Answer 2 · 2016-08-26

1

Entering edit mode

8.7 years ago

EVR ▴ 610

HI,

try to make use of samtools faidx which can get you the fasta sequences for your signifcant genes ot bedtools getfasta

ADD COMMENT • link 8.7 years ago by EVR ▴ 610