Merged.gtf vs. Rnor_6.0.gtf for Raw Counts
0
0
Entering edit mode
8.1 years ago
jmsyl.hong • 0

Hey guys,

The core facility for RNA-Seq used cuffmerge to generated a merged.gtf file (i.e. a merged assembly file).

I wanted to know what the difference was using the merged.gtf file to get raw counts using subread from .bams vs. using the most recent ensembl .gtf for my species.

When I run the subread with the merged.gtf file, I only get cufflinks IDs for the gene names with no gene symbols. Does anyone know if there is gene symbol or shortname information in the merged.gtf?

Thanks in advance!

RNA-Seq cuffmerge raw counts • 3.5k views
ADD COMMENT
1
Entering edit mode

Can you show the first few lines of your gtf file, with unix head command?

ADD REPLY
0
Entering edit mode

I'm on a mac (am also new with command line) is there something i can run on terminal or perhaps i can try to open this .gtf file with a text editor?

ADD REPLY
1
Entering edit mode

Mac should have a bash terminal

Go to the folder with your file and type:

head merged.gtf
ADD REPLY
0
Entering edit mode
1   Cufflinks   exon    312155  312360  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    315197  315479  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    315933  316737  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    317591  317815  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    321633  321756  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    322916  323800  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "6"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    312155  312383  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "1"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
1   Cufflinks   exon    386141  386392  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "2"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
1   Cufflinks   exon    391729  392532  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "3"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
1   Cufflinks   exon    393385  393606  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "4"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
ADD REPLY
0
Entering edit mode

from this it actually seems like there is a ton of information in the merged.gtf, including the gene_names... I'm using RNASeqGUI (because from command-line dumb) to do the raw counts it probably only gets the first column (e.g. gene_id which is the XLOC prefix) instead of the gene name.

is there a way for me to convert my existing raw counts which is mapped to XLOC IDs to the gene_names through this file, perhaps I can open it up in Excel and do some matching?

ADD REPLY
1
Entering edit mode

No processing in excel! Is there a 1-on-1 relationship for gene_id "XLOC" and gene_name? There should be a better way. Either you change the raw counts, e.g. by a python script I could write or you change the gtf and repeat the counting with featurecounts and set the -g flag to gene_name.

ADD REPLY
0
Entering edit mode

I would prefer if I could change the raw counts as I already did them for the entire dataset. That python script would be amazing. I have a tab delimited file that has XLOC in column 1, with each sample as headers thereafter, so essentially, it would have to match the XLOC from the first column of that file to the XLOC in the .gtf then make a new column on the tab delimited file with the gene name.

There is a 1:1 relationship between the XLOC and the gene name.

ADD REPLY
1
Entering edit mode

Do you have a bit of experience in programming? It's obviously more interesting for you if you code a bit yourself, and I don't mind guiding you in that for this script.

If not, could you show me a bit of data, first 10 lines for example, so I know what I'm working on?

ADD REPLY
0
Entering edit mode

Hey, it wouldn't let me post more than 5 posts per 6 hours... I would be very interested in learning how to get through the timecourse pipeline by deseq2 starting with the generation of raw counts. Is there a way we could communicate? I would be happy to sent you tons of coffees for your troubles!

ADD REPLY
0
Entering edit mode

I've never done a time course experiment, so I won't be of much use for that. I was mainly talking about converting the gene identifiers.

ADD REPLY
0
Entering edit mode

Sure, that would definitely help too.

I have two files:

  1. merged.gtf have a gene_name (gene names) column that is matched 1:1 to the gene_id column (XLOC)

  2. treatment1.txt have a gene_id column (XLOC) column with subsequent columns that contain the sample ids with their respective reads for each of the XLOCs.

I would like to match the XLOCs to their gene names from the merged.gtf file and have a column in the treatment1.txt that has the gene names.

Thanks!!!

ADD REPLY
1
Entering edit mode

Can you show me a part of treatment1.txt? Apart from knowing that structure the script seems ready.

I made some assumptions about the countfile but will update my code if those were not correct, see code below:

ADD REPLY
0
Entering edit mode

Gene ID 3.4 3.7 3.9 3.3 3.1 3.1S 3.6S 3.2S 1.4 1.8 1.1 1.1 1.11 1.5S 1.1S 1.4S 2.11 2.1 2.15 2.1 2.16 2.7S 2.5S 2.6S 8.16 8.5 8.12 8.13 8.15 8.3S 8.4S 8.5S WT.1 WT.2 WT3 XLOC_000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000002 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000003 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 XLOC_000004 5 3 1 1 6 5 1 1 7 3 8 4 4 2 7 3 4 7 2 0 5 0 2 0 2 3 4 10 3 1 4 1 0 1 2 XLOC_000005 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ADD REPLY
1
Entering edit mode

Script needs modification of one line because the header will be a problem

ADD REPLY
0
Entering edit mode

thanks so much! let me know how i can execute it after.

ADD REPLY
1
Entering edit mode

Instructions are at the top of the script, I edited it on my phone so I hope I didn't make a typo.

ADD REPLY
0
Entering edit mode

Got an error:

python renameCountIdentifiers.py merged.gtf cervical1.txt
Traceback (most recent call last):
  File "renameCountIdentifiers.py", line 28, in <module>
    changeIdentifierCounts(sys.argv[1], sys.argv[2])
  File "renameCountIdentifiers.py", line 24, in changeIdentifierCounts
    print("{}\t{}".format(gtfdict[line.strip().split('\t')[0]], '\t'.join(line.strip().split('\t')[1:])))
KeyError: 'c'
ADD REPLY
0
Entering edit mode

Odd. Hope to get near computer soon, maybe tomorrow, if not Sunday. Already seen a typo, gene instead of Gene, but doesn't explain the error you obtained. Files are tab separated right? Problem comes from your counts file.

ADD REPLY
0
Entering edit mode

Should be tab-delimited. I'll try a few things in the meanwhile.

ADD REPLY
0
Entering edit mode

Hi James.hong,

I added a few lines to make sure a warning is printed and script isn't halted when a KeyError is found (meaning that the identifier wasn't present in the dictionary generated from the gtf file and as such can't be converted). The offending line will be printed to your terminal and the identifier won't be converted, just kept as it was. This should help us to track down the error.

Let me know the result!

ADD REPLY
0
Entering edit mode

Yeah there is info about e.g. the gene name (or ensembl code), which is probably what you want. You can use AWK to extract that out of your file, but if you're not into command line, you probably want to ask your core facility for help with this.

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6