Question

Using StringTie with DESeq2(StringTie geneID and corresponding Ensembl geneID)

1

Entering edit mode

7.4 years ago

Megan ▴ 50

Hi, I am doing transcript reconstruction and quantification using StringTie and DEA using DESeq2. I have generated the count matrices of genes and transcripts using the python script prepDE.py provided by https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual . However the gene id and transcript id are all StringTie id, how can I generate the matrices contain both StringTie id and corresponding Ensembl id? Thanks.

StringTie DESeq2 • 6.7k views

ADD COMMENT • link updated 7.4 years ago by stephensmithgalway ▴ 20 • written 7.4 years ago by Megan ▴ 50

1

Entering edit mode

could you please share the commands you used while running stringtie? and what does your output look like after using prepDE.py

ADD REPLY • link 7.4 years ago by badribio ▴ 290

score 2 · Answer 1 · 2017-07-06

I am new to both programming and bioinformatics and I struggled with this recently. I ended up writing a few different scripts to try and get something usable. They are hackjobs because of my inexperience.

First is to create a mapping file with three columns gene_id Gene_name Transcript_ID run this first code on each of the .gtf files created during the second pass of stringtie where you use the merged GTF from the first pass.

Usage is python script.py in.gtf out.csv

#!/usr/bin/env python2
import re
import sys
print("Loading in the GTF file")
gtf = open(sys.argv[1], "r")

gtf = gtf.read()

#this will vary depending on the command used to run stringtie Edit the script based on the top of your GTF files. 

StringtieCommand= 'check the top of your GTF files for the command you used. paste it in here'

gtf = gtf[(len(StringtieCommand)):]

gtf = gtf.split('\n')
print("Getting GeneID,GeneName and Transcript_ID information")
code=[]
for i in gtf:
    Name = re.findall(r'ref_gene_name "(.*?)";',i)
    ID= re.findall(r'gene_id "(.*?)";',i)
    TS= re.findall(r'transcript_id "(.*?)";',i)

    if len(ID) == 0:
        code.append("Error")
    else:
        code.append(str(ID[0]))

    if len(Name)==0:
        #only applies to my GTF
        if "X" in str(ID) or "L" in str(ID) or "l" in str(ID):
            code.append(ID[0])
        else:
            code.append("Novel")
    else:
        code.append(str(Name[0]))

    if len(TS) == 0:
        code.append("No transcript_id")
    else:
        code.append(str(TS[0]))

print("Matching GeneID to GeneName and Transcript_ID")
mapping= []
for i in range(0,(int(len(code))),3):
    mapping.append(code[i] + "\t" + code[i+1] + "\t" + str(code[i+2]) + ("\n"))

print("Removing duplicates")
mapping = list(set(mapping))

print("Writing mapping information to new file")
out= open(sys.argv[2],"w")
out.write("GeneID" + "\t" + "Gene_name" + "\t" + "Transcript_ID" + "\n")
for i in mapping:
    out.write(i)

print("Program finished")

run second script to concatonate the map files. Usage is python script.py FinalMapOut.csv map1.csv map2.csv.....n.csv

#!/usr/bin/env python2
import sys
import os
out= open(sys.argv[1], "a")

print("Reading in map.csv files." + "\n")
for i in range(2,(len(sys.argv))):
    print("file number " + str(i))
    Maps= open(sys.argv[i])
    Maps= Maps.readlines()
    print("Concatinating")
    for j in Maps:
        out.write(j)
print("New concatonated file created" +"\n")
out.close()
print("Loading in new concatonated file" + "\n")
infile = open(sys.argv[1],"r")

lines= infile.readlines()
infile.close()
mapping=[]

for t in lines:
    mapping.append(t)
print("Removing duplicates from concatonated file" + "\n")
mapping= list(set(mapping))

print("Writing unique GeneID:Gene_Name matches to Full Map file"+ "\n")
outfile= open(sys.argv[1],"w")
outfile.write("GeneID" + "\t" + "Gene_Name" + "\t" + "Transcript_id" + "\n")
for r in mapping:
    outfile.write(r)
outfile.close()

print("Deleting old map files")
for z in range(2,(len(sys.argv))):
    os.remove(sys.argv[z])

print("Program finished")

run DESeq2 first then run this script to add gene names to the rsults

library(dplyr)
#load in full map file
data =as.data.frame(read.csv("map.csv", sep= ","))
data = data %>% group_by(Gene_Name) %>% filter(!("Novel" %in% Gene_Name))
write.csv(data, file= "MapFileFiltered.csv", row.names= FALSE)
data= as.data.frame(read.csv("/home/stephen/MapFileFiltered.csv", stringsAsFactors= FALSE))

#load deseq2 results
res= as.data.frame(read.csv("res.csv" , stringsAsFactors= FALSE))
for(id in 1:nrow(res)){
  res$X[res$X %in% data$GeneID[id]] <- data$Gene_Name[id]
}
write.csv(res, file=("Stringtie_DESeq2_with_gene_names.csv"), row.names = FALSE)

You will now have your DESeq2 results with the gene names. Any that still have MSTRG as the gene name should be completely novel transcripts that were not in the gtf you used to run stringtie originally.

These scripts worked for me and they will probably need heavy editing for you. It all depends on the format of your original GTF that you used for the first stringtie run.

I just thought I'd post them to give an idea of what needs to be done between stringtie and DESeq2 and so maybe somebody else with actual coding experience can make something quicker and easier for us to use.

What would be perfect and what I tried to do was to edit the original gene count matrix from stringtie but when I did anything to it DESeq2 wouldnt accept it.