Using StringTie with DESeq2(StringTie geneID and corresponding Ensembl geneID)
1
1
Entering edit mode
7.4 years ago
Megan ▴ 50

Hi, I am doing transcript reconstruction and quantification using StringTie and DEA using DESeq2. I have generated the count matrices of genes and transcripts using the python script prepDE.py provided by https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual . However the gene id and transcript id are all StringTie id, how can I generate the matrices contain both StringTie id and corresponding Ensembl id? Thanks.

StringTie DESeq2 • 6.7k views
ADD COMMENT
1
Entering edit mode

could you please share the commands you used while running stringtie? and what does your output look like after using prepDE.py

ADD REPLY
2
Entering edit mode
7.4 years ago

I am new to both programming and bioinformatics and I struggled with this recently. I ended up writing a few different scripts to try and get something usable. They are hackjobs because of my inexperience.

First is to create a mapping file with three columns gene_id Gene_name Transcript_ID run this first code on each of the .gtf files created during the second pass of stringtie where you use the merged GTF from the first pass.

Usage is python script.py in.gtf out.csv

#!/usr/bin/env python2
import re
import sys
print("Loading in the GTF file")
gtf = open(sys.argv[1], "r")

gtf = gtf.read()

#this will vary depending on the command used to run stringtie Edit the script based on the top of your GTF files. 

StringtieCommand= 'check the top of your GTF files for the command you used. paste it in here'

gtf = gtf[(len(StringtieCommand)):]

gtf = gtf.split('\n')
print("Getting GeneID,GeneName and Transcript_ID information")
code=[]
for i in gtf:
    Name = re.findall(r'ref_gene_name "(.*?)";',i)
    ID= re.findall(r'gene_id "(.*?)";',i)
    TS= re.findall(r'transcript_id "(.*?)";',i)

    if len(ID) == 0:
        code.append("Error")
    else:
        code.append(str(ID[0]))

    if len(Name)==0:
        #only applies to my GTF
        if "X" in str(ID) or "L" in str(ID) or "l" in str(ID):
            code.append(ID[0])
        else:
            code.append("Novel")
    else:
        code.append(str(Name[0]))

    if len(TS) == 0:
        code.append("No transcript_id")
    else:
        code.append(str(TS[0]))

print("Matching GeneID to GeneName and Transcript_ID")
mapping= []
for i in range(0,(int(len(code))),3):
    mapping.append(code[i] + "\t" + code[i+1] + "\t" + str(code[i+2]) + ("\n"))

print("Removing duplicates")
mapping = list(set(mapping))

print("Writing mapping information to new file")
out= open(sys.argv[2],"w")
out.write("GeneID" + "\t" + "Gene_name" + "\t" + "Transcript_ID" + "\n")
for i in mapping:
    out.write(i)

print("Program finished")

run second script to concatonate the map files. Usage is python script.py FinalMapOut.csv map1.csv map2.csv.....n.csv

#!/usr/bin/env python2
import sys
import os
out= open(sys.argv[1], "a")

print("Reading in map.csv files." + "\n")
for i in range(2,(len(sys.argv))):
    print("file number " + str(i))
    Maps= open(sys.argv[i])
    Maps= Maps.readlines()
    print("Concatinating")
    for j in Maps:
        out.write(j)
print("New concatonated file created" +"\n")
out.close()
print("Loading in new concatonated file" + "\n")
infile = open(sys.argv[1],"r")

lines= infile.readlines()
infile.close()
mapping=[]

for t in lines:
    mapping.append(t)
print("Removing duplicates from concatonated file" + "\n")
mapping= list(set(mapping))

print("Writing unique GeneID:Gene_Name matches to Full Map file"+ "\n")
outfile= open(sys.argv[1],"w")
outfile.write("GeneID" + "\t" + "Gene_Name" + "\t" + "Transcript_id" + "\n")
for r in mapping:
    outfile.write(r)
outfile.close()

print("Deleting old map files")
for z in range(2,(len(sys.argv))):
    os.remove(sys.argv[z])

print("Program finished")

run DESeq2 first then run this script to add gene names to the rsults

library(dplyr)
#load in full map file
data =as.data.frame(read.csv("map.csv", sep= ","))
data = data %>% group_by(Gene_Name) %>% filter(!("Novel" %in% Gene_Name))
write.csv(data, file= "MapFileFiltered.csv", row.names= FALSE)
data= as.data.frame(read.csv("/home/stephen/MapFileFiltered.csv", stringsAsFactors= FALSE))

#load deseq2 results
res= as.data.frame(read.csv("res.csv" , stringsAsFactors= FALSE))
for(id in 1:nrow(res)){
  res$X[res$X %in% data$GeneID[id]] <- data$Gene_Name[id]
}
write.csv(res, file=("Stringtie_DESeq2_with_gene_names.csv"), row.names = FALSE)

You will now have your DESeq2 results with the gene names. Any that still have MSTRG as the gene name should be completely novel transcripts that were not in the gtf you used to run stringtie originally.

These scripts worked for me and they will probably need heavy editing for you. It all depends on the format of your original GTF that you used for the first stringtie run.

I just thought I'd post them to give an idea of what needs to be done between stringtie and DESeq2 and so maybe somebody else with actual coding experience can make something quicker and easier for us to use.

What would be perfect and what I tried to do was to edit the original gene count matrix from stringtie but when I did anything to it DESeq2 wouldnt accept it.

ADD COMMENT
0
Entering edit mode

Hi, Stephen. Why don't you use the merged GTF which includes gene_ID, transctript_id, gene_name and ref_gene_id from the first pass of Stringtie directly to extract these information?

ADD REPLY

Login before adding your answer.

Traffic: 1793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6