Question

Question about multi FASTA input for eXpress

1

Entering edit mode

9.7 years ago

bharata1803 ▴ 560

Hello,

I want to ask what is the meaning of no. 1 of this explanation from eXpress website?

eXpress requires two input files:

A multi-FASTA file containing the transcript sequences. If the transcriptome of your organism is not annotated, you can generate this file from your sequencing reads using a de novo transcriptome assembler such as Trinity, Oases, or Trans-ABySS. If your organism has a reference genome you can assemble transcripts directly from mapped reads using Cufflinks. If your genome is already annotated (in GTF/GFF), you can generate a multi-FASTA file using the UCSC Genome Browser by uploading your annotation as a track and downloading the sequences under the "Tables" tab.
Read alignments to the multi-FASTA file in SAM or BAM format. These can either be stored in a file or streamed directly from an aligner. It is important that you allow as many multi-mappings as possible. You can also allow many mismatches during mapping since eXpress builds an error model to probabalistically assign the reads, although this will increase mapping time. If you are combining reads from several library preparations or from sequencing runs using different read lengths, please see the Manual for important details on how the alignments should be input.

I want to use human genome Hg38. I download the genome sequence in fasta format and gene annotation file in GTF format. I don't know what to do because it seems using these 2 files is not working for eXpress. In no. 1 also said I can upload to UCSC Genome Browser and download the sequences but I don't know how to do it. So, what file I should use for the multi-fasta file for no. 1? And besides that, can someone explain what is the difference between transcriptome sequence and genome sequence? Thank you in advance

RNA-Seq eXpress • 3.2k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by bharata1803 ▴ 560

0

Entering edit mode

You will need the fasta file for the transcripts, not the genome. You can get the cDNA (transcript) fasta here from ensembl

The main difference of the transcriptome and genome sequence is that the genome contains information of the genome as is e.g. Sequences are separated by chromosome / contigs. Whereas for transcriptome, each entry is an individual transcript that can be observed in the genome.

For example, if we have the following in the genome:

Gene A                | Exon 1 |--------| Exon2 |--------| Exon3 |

The transcript sequence file might represent like

Transcript 1 (of gene A)  |Exon1||Exon2|
Transcript 2 (of gene A)  |Exon1||Exon3|
Transcript 3 (of gene A)  |Exon2||Exon3|
Transcript 4 (of gene A)  |Exon1||Exon2|Exon3|

Whereas the genome sequence file will just show the whole sequence including the introns

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sam ★ 4.8k

0

Entering edit mode

Thank you for your reply and explanation. It help me a lot. By the way, I notice the Gene annotation (GTF) in UCSC genome browser can be downloaded as sequence in FASTA format. Is it also usable for eXpress input?

ADD REPLY • link 9.7 years ago by bharata1803 ▴ 560

0

Entering edit mode

You just have to make sure that the fasta file is something like

>transcript_1

ACTGATCG

>transcript_2

ACTAG

instead of something like

>chr1

ACTGACTG.........

>chr2

AATCACA............

I usually use ensembl reference so I am not sure how your FASTA look like. But as long as you know that the sequence are transcript sequence, then it should be fine.

ADD REPLY • link 9.7 years ago by Sam ★ 4.8k