Hello,
I want to ask what is the meaning of no. 1 of this explanation from eXpress website?
eXpress requires two input files:
- A multi-FASTA file containing the transcript sequences. If the transcriptome of your organism is not annotated, you can generate this file from your sequencing reads using a de novo transcriptome assembler such as Trinity, Oases, or Trans-ABySS. If your organism has a reference genome you can assemble transcripts directly from mapped reads using Cufflinks. If your genome is already annotated (in GTF/GFF), you can generate a multi-FASTA file using the UCSC Genome Browser by uploading your annotation as a track and downloading the sequences under the "Tables" tab.
- Read alignments to the multi-FASTA file in SAM or BAM format. These can either be stored in a file or streamed directly from an aligner. It is important that you allow as many multi-mappings as possible. You can also allow many mismatches during mapping since eXpress builds an error model to probabalistically assign the reads, although this will increase mapping time. If you are combining reads from several library preparations or from sequencing runs using different read lengths, please see the Manual for important details on how the alignments should be input.
I want to use human genome Hg38. I download the genome sequence in fasta format and gene annotation file in GTF format. I don't know what to do because it seems using these 2 files is not working for eXpress. In no. 1 also said I can upload to UCSC Genome Browser and download the sequences but I don't know how to do it. So, what file I should use for the multi-fasta file for no. 1? And besides that, can someone explain what is the difference between transcriptome sequence and genome sequence? Thank you in advance
You will need the fasta file for the transcripts, not the genome. You can get the cDNA (transcript) fasta here from ensembl
The main difference of the transcriptome and genome sequence is that the genome contains information of the genome as is e.g. Sequences are separated by chromosome / contigs. Whereas for transcriptome, each entry is an individual transcript that can be observed in the genome.
For example, if we have the following in the genome:
The transcript sequence file might represent like
Whereas the genome sequence file will just show the whole sequence including the introns
Thank you for your reply and explanation. It help me a lot. By the way, I notice the Gene annotation (GTF) in UCSC genome browser can be downloaded as sequence in FASTA format. Is it also usable for eXpress input?
You just have to make sure that the fasta file is something like
>transcript_1
ACTGATCG
>transcript_2
ACTAG
instead of something like
>chr1
ACTGACTG.........
>chr2
AATCACA............
I usually use ensembl reference so I am not sure how your FASTA look like. But as long as you know that the sequence are transcript sequence, then it should be fine.