Entering edit mode
8.1 years ago
jmsyl.hong
•
0
Hey guys,
The core facility for RNA-Seq used cuffmerge to generated a merged.gtf file (i.e. a merged assembly file).
I wanted to know what the difference was using the merged.gtf file to get raw counts using subread from .bams vs. using the most recent ensembl .gtf for my species.
When I run the subread with the merged.gtf file, I only get cufflinks IDs for the gene names with no gene symbols. Does anyone know if there is gene symbol or shortname information in the merged.gtf?
Thanks in advance!
Can you show the first few lines of your gtf file, with unix head command?
I'm on a mac (am also new with command line) is there something i can run on terminal or perhaps i can try to open this .gtf file with a text editor?
Mac should have a bash terminal
Go to the folder with your file and type:
from this it actually seems like there is a ton of information in the merged.gtf, including the gene_names... I'm using RNASeqGUI (because from command-line dumb) to do the raw counts it probably only gets the first column (e.g. gene_id which is the XLOC prefix) instead of the gene name.
is there a way for me to convert my existing raw counts which is mapped to XLOC IDs to the gene_names through this file, perhaps I can open it up in Excel and do some matching?
No processing in excel! Is there a 1-on-1 relationship for gene_id "XLOC" and gene_name? There should be a better way. Either you change the raw counts, e.g. by a python script I could write or you change the gtf and repeat the counting with featurecounts and set the
-g
flag togene_name
.I would prefer if I could change the raw counts as I already did them for the entire dataset. That python script would be amazing. I have a tab delimited file that has XLOC in column 1, with each sample as headers thereafter, so essentially, it would have to match the XLOC from the first column of that file to the XLOC in the .gtf then make a new column on the tab delimited file with the gene name.
There is a 1:1 relationship between the XLOC and the gene name.
Do you have a bit of experience in programming? It's obviously more interesting for you if you code a bit yourself, and I don't mind guiding you in that for this script.
If not, could you show me a bit of data, first 10 lines for example, so I know what I'm working on?
Hey, it wouldn't let me post more than 5 posts per 6 hours... I would be very interested in learning how to get through the timecourse pipeline by deseq2 starting with the generation of raw counts. Is there a way we could communicate? I would be happy to sent you tons of coffees for your troubles!
I've never done a time course experiment, so I won't be of much use for that. I was mainly talking about converting the gene identifiers.
Sure, that would definitely help too.
I have two files:
merged.gtf
have a gene_name (gene names) column that is matched 1:1 to the gene_id column (XLOC)treatment1.txt
have a gene_id column (XLOC) column with subsequent columns that contain the sample ids with their respective reads for each of the XLOCs.I would like to match the XLOCs to their gene names from the
merged.gtf
file and have a column in thetreatment1.txt
that has the gene names.Thanks!!!
Can you show me a part of
treatment1.txt
? Apart from knowing that structure the script seems ready.I made some assumptions about the countfile but will update my code if those were not correct, see code below:
Gene ID 3.4 3.7 3.9 3.3 3.1 3.1S 3.6S 3.2S 1.4 1.8 1.1 1.1 1.11 1.5S 1.1S 1.4S 2.11 2.1 2.15 2.1 2.16 2.7S 2.5S 2.6S 8.16 8.5 8.12 8.13 8.15 8.3S 8.4S 8.5S WT.1 WT.2 WT3 XLOC_000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000002 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000003 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 XLOC_000004 5 3 1 1 6 5 1 1 7 3 8 4 4 2 7 3 4 7 2 0 5 0 2 0 2 3 4 10 3 1 4 1 0 1 2 XLOC_000005 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Script needs modification of one line because the header will be a problem
thanks so much! let me know how i can execute it after.
Instructions are at the top of the script, I edited it on my phone so I hope I didn't make a typo.
Got an error:
Odd. Hope to get near computer soon, maybe tomorrow, if not Sunday. Already seen a typo, gene instead of Gene, but doesn't explain the error you obtained. Files are tab separated right? Problem comes from your counts file.
Should be tab-delimited. I'll try a few things in the meanwhile.
Hi James.hong,
I added a few lines to make sure a warning is printed and script isn't halted when a KeyError is found (meaning that the identifier wasn't present in the dictionary generated from the gtf file and as such can't be converted). The offending line will be printed to your terminal and the identifier won't be converted, just kept as it was. This should help us to track down the error.
Let me know the result!
Yeah there is info about e.g. the gene name (or ensembl code), which is probably what you want. You can use AWK to extract that out of your file, but if you're not into command line, you probably want to ask your core facility for help with this.