Hi
I am having RNA-Seq data from TCGA. I want to separate lncRNA and mRNA from RNA-Seq data for differential expression. Please, guide me how to separate lncRNA and mRNA from RNA-Seq data?
Thank you
Hi
I am having RNA-Seq data from TCGA. I want to separate lncRNA and mRNA from RNA-Seq data for differential expression. Please, guide me how to separate lncRNA and mRNA from RNA-Seq data?
Thank you
Just adding an even easier way to identify biotype automatically - this will create an annotation data-frame that can be used for any purpose downstream:
require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
annot <- getBM(
mart = mart,
attributes = c(
'ensembl_gene_id',
'hgnc_symbol',
'entrezgene_id',
'gene_biotype'),
uniqueRows = TRUE)
ensembl_gene_id hgnc_symbol entrezgene_id gene_biotype
1 ENSG00000210049 MT-TF NA Mt_tRNA
2 ENSG00000211459 MT-RNR1 NA Mt_rRNA
3 ENSG00000210077 MT-TV NA Mt_tRNA
4 ENSG00000210082 MT-RNR2 NA Mt_rRNA
5 ENSG00000209082 MT-TL1 NA Mt_tRNA
6 ENSG00000198888 MT-ND1 4535 protein_coding
unique(annot$gene_biotype)
[1] "Mt_tRNA" "Mt_rRNA"
[3] "protein_coding" "transcribed_unprocessed_pseudogene"
[5] "unprocessed_pseudogene" "miRNA"
[7] "lncRNA" "processed_pseudogene"
[9] "snRNA" "transcribed_processed_pseudogene"
[11] "misc_RNA" "TEC"
[13] "transcribed_unitary_pseudogene" "snoRNA"
[15] "scaRNA" "rRNA_pseudogene"
[17] "unitary_pseudogene" "polymorphic_pseudogene"
[19] "pseudogene" "rRNA"
[21] "IG_V_pseudogene" "ribozyme"
[23] "sRNA" "TR_V_gene"
[25] "TR_V_pseudogene" "TR_D_gene"
[27] "TR_J_gene" "TR_C_gene"
[29] "TR_J_pseudogene" "IG_C_gene"
[31] "IG_C_pseudogene" "IG_J_gene"
[33] "IG_J_pseudogene" "IG_D_gene"
[35] "IG_V_gene" "IG_pseudogene"
[37] "translated_processed_pseudogene" "scRNA"
[39] "vault_RNA" "translated_unprocessed_pseudogene"
Kevin
Check the GTF (genome annotation file) that matches your analysis and filter out those genes that have a CDS (coding sequence) and/or are annotated as protein-coding. Having the gene names, you can then scan your count matrix before or after differential expression analysis for the genes of interest.
Edit: See Kevin's comment below for details.
Indeed, was just doing it:
Obtain the relevant GTF from GENCODE (you will want the Comprehensive gene annotation) and then identify lincRNA
and protein_coding
mRNA via the gene_type
/ transcript_type
tag in the GTF.
grep -e "lincRNA" /Kev/CollegeWork/ReferenceMaterial/GENCODE/GRCh38.p12/gencode.v28.annotation.gtf | head -5
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 29554 30039 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 30564 30667 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 30976 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
Take a look at all biotypes, here: https://www.gencodegenes.org/pages/biotypes.html
This is useful. After I grep different transcript biotypes from the GTF using the applicable search terms, how can I get these transcript biotypes from my own FASTA file? So that I get RNA reads separated based on transcript biotype. I am starting with my FASTA file with the RNA-Seq reads and then the GTF. Thank you.
If such a FASTA is not already available on GENCODE at the site(s) to which we link you, then you could filter the 'comprehensive' FASTA using AWK or Python. If you do not feel too comfortable going that route, you could just derive counts over all transcripts and then filter this data when you aim to use it downstream, e.g., in DESeq2
Thank you for your reply and for the code for an automatic route. Actually, I am not the OP. But your post C: How to separate lncRNA and mRNA from RNA-Seq data for differential expression? looked a lot like what I had been wanting to do except that it is what I want to do for my RNA-Seq reads, if possible. Although what I want to do is related to what the OP wanted to do, mine is how do I split up my FASTA file which contains the RNA-Seq reads into the different transcript biotypes? This is only dealing with sequences. Thank you.
Hi, as I suggested, you could do this with awk or Python. If you have no coding skills, then it is obviously a problem.
As mentioned, also, if your aim is to perform read count abundance over these transcripts, then you could perform this [the read count abundance step] over all transcripts, and then filter for, e.g., lncRNA, after this step. It would be easier than filtering the FASTA file, I think.
I see, but, to do that, you still need to align these reads.
In the GENCODE FASTA header, the biotype is already encoded [in the header]. So, it is still feasible for you to align your reads to all transcripts and then check to see for which ones most reads have aligned.
Look, here are the FASTA headers in the current GENCODE release (see biotype at end of header):
zcat gencode.v35.transcripts.fa.gz | grep -e "^>" | head -15
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
>ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|
>ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|
>ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|
>ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|
>ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-2HG-201|MIR1302-2HG|535|lncRNA|
>ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|
>ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-201|FAM138A|1187|lncRNA|
>ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-202|FAM138A|590|lncRNA|
>ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|OR4G4P-201|OR4G4P|840|unprocessed_pseudogene|
>ENST00000642116.1|ENSG00000240361.2|OTTHUMG00000001095.3|OTTHUMT00000492680.1|OR4G11P-202|OR4G11P|1414|processed_transcript|
>ENST00000492842.2|ENSG00000240361.2|OTTHUMG00000001095.3|OTTHUMT00000003224.3|OR4G11P-201|OR4G11P|939|transcribed_unprocessed_pseudogene|
>ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094.4|OTTHUMT00000003223.4|OR4F5-202|OR4F5|2618|protein_coding|
>ENST00000335137.4|ENSG00000186092.6|OTTHUMG00000001094.4|-|OR4F5-201|OR4F5|1054|protein_coding|
>ENST00000466430.5|ENSG00000238009.6|OTTHUMG00000001096.2|OTTHUMT00000003225.1|AL627309.1-201|AL627309.1|2748|lncRNA|
Please, could this alignment be done using STAR? STAR requires reference genome sequences (FASTA files) and annotations (GTF file) that it uses to generate genome index files. Then in the second mapping step, STAR uses the genome index files along with the user's RNA sequence reads (FASTA files) to map the user's RNA sequence reads to the genome.
When giving examples of acceptable genome sequence files for GENCODE, STAR lists "files marked with PRI (primary)", which makes me think it wants me to use ....primary_assembly.genome.fa.gz but from this discussion, I want to use gencode.v35.transcripts.fa.gz
The STAR manual contains the information above in sections 1.2 and 2.21 on pages 4, 5, 6. https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
Thank you.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please explain how you performed the analysis? Alignment -> quantification (which annotation) -> Differential expression (which tool) ? And what kind of file you have now (paste sample output)?
When you provide more information, you get more specific response.
I downloaded raw RNA-Seq data of cancer in fastq format. Then I performed alignment using Tophat. Now, I am planning to do quantification and differential expression using Salmon and Deseq2. But I have to identify lncRNA and mRNA before differential expression.
few remarks:
TopHat is no longer advised to do read alignment (using Salmon you even don't need to do actual alignments anymore).
I'm puzzled why you are eager to filter gene types before doing DEG analysis? why not use all of them and filter after doing DEG?
Can you add a bit more details on this? For example, what kind of data do you have? which organism are you working with?
From 'raw' RNAseq data it will be impossible to filter out reads that are derived from either one of them. If you align them to a genome or transcriptome you might be able to distinguish them.