Is it acceptable to merge .gtf files from different cell lines?
0
0
Entering edit mode
7.1 years ago
moldach ▴ 130

Dear Biostars users,

We recently submitted data to a sequencing center and I have some concern about the Cufflinks/Cuffmerge/Cuffdiff pipeline that was run (PLEASE: no off-topic responses about how I shouldn't be using this pipeline and should use DESeq2, or some other pipeline - I already know this!).

This issue concerns whether it is appropriate to merge the transcripts.gtf files created by Cufflinks from multiple cell lines A549, HAP1, K562.

The sequencing facility used Cuffmerge to every transcripts.gtf from all cell lines into a merged.gtf file. Basically, I don’t know enough about the cufflinks/cuffmerge assembler, nor how different these cell lines are, to say whether it was appropriate to merge all of these together?

From what I understand, Cuffmerge is designed to merge full-length and partial transcript assemblies without ever merging transfrags that disagree on splicing structure - meaning Cuffmerge will merge two transfrags if they overlap and agree on splicing, and are in the same orientation.

So it seems to me that Cuffmerge will only merge features that are common between cell lines and discard those that are not. Am I just being paranoid?

UPDATE

Deanna M. Church @deannachurch suggested on Twitter that I should do some QC. Specifically, to start by looking for 1bp exons. I have to admit I'd never heard of micro-exons before. Although there are some well-documented examples of alternatively spliced micro-exons which are strongly conserved [PMID:25524026, PMID: 25525873, PMID: 26293963, PMID: 27565344] they are extremely rare.

The gene transfer format (GTF) 3rd field/column is feature (e.g. exon), while 4th is start and 5th is end so I looked for 1bp long micro-exons:

cat merged.gtf | awk '($3=="exon" && $5-$4+1 <2) {print}'

I wound up finding 36:

chr1    Cufflinks       exon    244367233       244367233       .       -       .       gene_id "XLOC_005396"; transcript_id "TCONS_00017260"; exon_number "1"; gene_name "ADSS"; oId "CUFF.3425.1"; nearest_ref "ENST00000366535"; class_code "j"; tss_id "TSS8564";
chr11   Cufflinks       exon    71580167        71580167        .       -       .       gene_id "XLOC_010505"; transcript_id "TCONS_00033349"; exon_number "2"; gene_name "KRTAP5-11"; oId "ENST00000616086"; nearest_ref "ENST00000616086"; class_code "="; tss_id "TSS16677"; p_id "P8666";
chr11   Cufflinks       exon    76191778        76191778        .       -       .       gene_id "XLOC_010591"; transcript_id "TCONS_00033697"; exon_number "3"; gene_name "WNT11"; oId "CUFF.6324.3"; nearest_ref "ENST00000621122"; class_code "="; tss_id "TSS16829"; p_id "P8751";
chr11   Cufflinks       exon    76191778        76191778        .       -       .       gene_id "XLOC_010591"; transcript_id "TCONS_00033701"; exon_number "3"; gene_name "WNT11"; oId "ENST00000621122"; nearest_ref "ENST00000621122"; class_code "="; tss_id "TSS16832"; p_id "P8751";
chr11   Cufflinks       exon    101050949       101050949       .       -       .       gene_id "XLOC_010791"; transcript_id "TCONS_00034359"; exon_number "4"; gene_name "PGR"; oId "CUFF.6606.8"; nearest_ref "ENST00000325455"; class_code "j"; tss_id "TSS17155";
chr11   Cufflinks       exon    101050949       101050949       .       -       .       gene_id "XLOC_010791"; transcript_id "TCONS_00034358"; exon_number "4"; gene_name "PGR"; oId "CUFF.6606.7"; nearest_ref "ENST00000325455"; class_code "j"; tss_id "TSS17155";
chr11   Cufflinks       exon    101050949       101050949       .       -       .       gene_id "XLOC_010791"; transcript_id "TCONS_00034351"; exon_number "4"; gene_name "PGR"; oId "CUFF.6606.10"; nearest_ref "ENST00000325455"; class_code "j"; tss_id "TSS17155";
chr11   Cufflinks       exon    101050949       101050949       .       -       .       gene_id "XLOC_010791"; transcript_id "TCONS_00034364"; exon_number "2"; gene_name "PGR"; oId "ENST00000617858"; contained_in "TCONS_00034358"; nearest_ref "ENST00000617858"; class_code "="; tss_id "TSS17156"; p_id "P8887";
chr14   Cufflinks       exon    23567597        23567597        .       +       .       gene_id "XLOC_015828"; transcript_id "TCONS_00049285"; exon_number "4"; gene_name "RP11-66N24.3"; oId "CUFF.9635.17"; nearest_ref "ENST00000555968"; class_code "j"; tss_id "TSS24869";
chr14   Cufflinks       exon    90402778        90402778        .       +       .       gene_id "XLOC_016406"; transcript_id "TCONS_00051142"; exon_number "1"; gene_name "CALM1"; oId "CUFF.10288.11"; nearest_ref "ENST00000447653"; class_code "o"; tss_id "TSS25799";
chr14   Cufflinks       exon    24632719        24632719        .       -       .       gene_id "XLOC_016882"; transcript_id "TCONS_00052676"; exon_number "3"; gene_name "GZMB"; oId "ENST00000382540"; nearest_ref "ENST00000382540"; class_code "="; tss_id "TSS26542"; p_id "P13537";
chr15   Cufflinks       exon    43401850        43401850        .       -       .       gene_id "XLOC_019280"; transcript_id "TCONS_00059371"; exon_number "1"; gene_name "TP53BP1"; oId "CUFF.10902.1"; nearest_ref "ENST00000382044"; class_code "j"; tss_id "TSS30129";
chr15   Cufflinks       exon    99970355        99970355        .       -       .       gene_id "XLOC_019985"; transcript_id "TCONS_00061773"; exon_number "1"; gene_name "ADAMTS17"; oId "CUFF.11729.1"; nearest_ref "ENST00000268070"; class_code "j"; tss_id "TSS31293";
chr16   Cufflinks       exon    2770352 2770352 .       +       .       gene_id "XLOC_020165"; transcript_id "TCONS_00062448"; exon_number "14"; gene_name "SRRM2"; oId "CUFF.11951.1"; nearest_ref "ENST00000301740"; class_code "j"; tss_id "TSS31606";
chr16   Cufflinks       exon    89553267        89553267        .       +       .       gene_id "XLOC_021282"; transcript_id "TCONS_00066046"; exon_number "16"; gene_name "SPG7"; oId "ENST00000620811"; nearest_ref "ENST00000620811"; class_code "="; tss_id "TSS33400"; p_id "P16770";
chr18   Cufflinks       exon    9887458 9887458 .       +       .       gene_id "XLOC_025718"; transcript_id "TCONS_00081021"; exon_number "3"; gene_name "TXNDC2"; oId "ENST00000611534"; nearest_ref "ENST00000611534"; class_code "="; tss_id "TSS40801"; p_id "P20539";
chr18   Cufflinks       exon    9887458 9887458 .       +       .       gene_id "XLOC_025718"; transcript_id "TCONS_00081023"; exon_number "4"; gene_name "TXNDC2"; oId "CUFF.15416.3"; nearest_ref "ENST00000611534"; class_code "j"; tss_id "TSS40801";
chr18   Cufflinks       exon    9887458 9887458 .       +       .       gene_id "XLOC_025718"; transcript_id "TCONS_00081022"; exon_number "3"; gene_name "TXNDC2"; oId "CUFF.15416.2"; nearest_ref "ENST00000611534"; class_code "j"; tss_id "TSS40801";
chr19   Cufflinks       exon    19895471        19895471        .       +       .       gene_id "XLOC_027317"; transcript_id "TCONS_00085993"; exon_number "3"; gene_name "ZNF253"; oId "CUFF.16659.4"; nearest_ref "ENST00000589717"; class_code "j"; tss_id "TSS43326";
chr19   Cufflinks       exon    4454794 4454794 .       -       .       gene_id "XLOC_028401"; transcript_id "TCONS_00090221"; exon_number "10"; gene_name "UBXN6"; oId "CUFF.16144.8"; nearest_ref "ENST00000301281"; class_code "j"; tss_id "TSS45254";
chr19   Cufflinks       exon    36658555        36658555        .       -       .       gene_id "XLOC_029002"; transcript_id "TCONS_00092356"; exon_number "5"; gene_name "ZNF461"; oId "CUFF.16940.4"; nearest_ref "ENST00000588268"; class_code "j"; tss_id "TSS46224";
chr2    Cufflinks       exon    96695297        96695297        .       +       .       gene_id "XLOC_030481"; transcript_id "TCONS_00097294"; exon_number "36"; gene_name "FER1L5"; oId "ENST00000623246"; nearest_ref "ENST00000623246"; class_code "="; tss_id "TSS48705"; p_id "P25170";
chr2    Cufflinks       exon    96695297        96695297        .       +       .       gene_id "XLOC_030481"; transcript_id "TCONS_00097291"; exon_number "37"; gene_name "FER1L5"; oId "CUFF.19029.6"; nearest_ref "ENST00000623246"; class_code "j"; tss_id "TSS48705";
chr2    Cufflinks       exon    40512349        40512349        .       -       .       gene_id "XLOC_032099"; transcript_id "TCONS_00102629"; exon_number "10"; gene_name "SLC8A1"; oId "CUFF.18487.9"; nearest_ref "ENST00000403092"; class_code "j"; tss_id "TSS51395";
chr21   Cufflinks       exon    38006828        38006828        .       -       .       gene_id "XLOC_036066"; transcript_id "TCONS_00114476"; exon_number "1"; gene_name "DSCR4"; oId "CUFF.21897.6"; nearest_ref "ENST00000398948"; class_code "j"; tss_id "TSS57549";
chr3    Cufflinks       exon    50617514        50617514        .       -       .       gene_id "XLOC_039610"; transcript_id "TCONS_00126194"; exon_number "4"; gene_name "CISH"; oId "CUFF.23701.1"; nearest_ref "ENST00000443053"; class_code "j"; tss_id "TSS63347";
chr4    Cufflinks       exon    1730388 1730388 .       +       .       gene_id "XLOC_040760"; transcript_id "TCONS_00129874"; exon_number "4"; gene_name "TACC3"; oId "ENST00000612220"; nearest_ref "ENST00000612220"; class_code "="; tss_id "TSS65159"; p_id "P32811";
chr4    Cufflinks       exon    109816009       109816009       .       +       .       gene_id "XLOC_041579"; transcript_id "TCONS_00132221"; exon_number "1"; gene_name "GAR1"; oId "CUFF.26055.9"; nearest_ref "ENST00000226796"; class_code "j"; tss_id "TSS66363";
chr4    Cufflinks       exon    109816009       109816009       .       +       .       gene_id "XLOC_041579"; transcript_id "TCONS_00132220"; exon_number "1"; gene_name "GAR1"; oId "CUFF.26055.10"; nearest_ref "ENST00000226796"; class_code "j"; tss_id "TSS66363";
chr4    Cufflinks       exon    169663114       169663114       .       +       .       gene_id "XLOC_041987"; transcript_id "TCONS_00133314"; exon_number "2"; gene_name "CLCN3"; oId "ENST00000504131"; nearest_ref "ENST00000504131"; class_code "="; tss_id "TSS66954"; p_id "P33590";
chr4    Cufflinks       exon    120083285       120083285       .       -       .       gene_id "XLOC_042957"; transcript_id "TCONS_00136111"; exon_number "6"; gene_name "MAD2L1"; oId "CUFF.26122.2"; nearest_ref "ENST00000296509"; class_code "j"; tss_id "TSS68414";
chr4    Cufflinks       exon    144120490       144120490       .       -       .       gene_id "XLOC_043118"; transcript_id "TCONS_00136504"; exon_number "3"; gene_name "GYPA"; oId "CUFF.26356.6"; nearest_ref "ENST00000535709"; class_code "j"; tss_id "TSS68639";
chr6    Cufflinks       exon    42031246        42031246        .       -       .       gene_id "XLOC_048613"; transcript_id "TCONS_00152503"; exon_number "5"; gene_name "CCND3"; oId "CUFF.29469.3"; nearest_ref "ENST00000511642"; class_code "j"; tss_id "TSS77022";
chr7    Cufflinks       exon    99396376        99396376        .       +       .       gene_id "XLOC_050386"; transcript_id "TCONS_00157909"; exon_number "11"; gene_name "ARPC1B"; oId "CUFF.31661.3"; nearest_ref "ENST00000451682"; class_code "j"; tss_id "TSS79796";
chr7    Cufflinks       exon    44122282        44122282        .       -       .       gene_id "XLOC_051487"; transcript_id "TCONS_00161237"; exon_number "11"; gene_name "POLD2"; oId "CUFF.31102.3"; nearest_ref "ENST00000406581"; class_code "j"; tss_id "TSS81506";
chrX    Cufflinks       exon    123697681       123697681       .       -       .       gene_id "XLOC_059655"; transcript_id "TCONS_00184665"; exon_number "35"; gene_name "THOC2"; oId "CUFF.36719.6"; nearest_ref "ENST00000245838"; class_code "j"; tss_id "TSS93904";

I assume I should filter these exons out but what about next steps for QC; what does this finding suggest about the quality of the merged.gtf file?

UPDATE 2

I compared these with hg38_e87.gtf which was used with hg38.fa by Cufflinks and noticed 15 are there. A brief inspection shows that many of the ones in hg38_e87 are repeated in merged.gtf. For example, in the example above Chromosome 18 (gene_name: TXNDC2) is repeated 3 times. The first entry matches hg38_e87.gtf exactly. The next two entries have subtle difference: transcript_id, exon_number, oId, and an entirely new entry p_id. I'm not sure why Cuffmerge adds duplicates or if this creates problems downstream - it will certainly slow down cuffdiffs as more data/repeated-isoforms means substantially more memory and running time.

6 of the micro-exons that were in hg38_e87.gtf are no longer in the merged.gtf (SCN7A, ZDHHC11, MEF2C, KRT17, MPP2, MED25), which isn't a problem.

However, a number of new micro-exons arose in the merged.gtf file (ADSS, RP11-66N24.3, CALM1, TP53BP1, ADAMTS17, SRRM2, ZNF253, UBXN6, ZNF461, SLC8A1, DSCR4, CISH, MAD2L1, GYPA, CCND3, ARPC1B, POLD2, THOC2). So do I just filter out these and leave the rest because they were originally in hg38?

Thank you

gtf cufflinks annotation cuffmerge • 2.2k views
ADD COMMENT

Login before adding your answer.

Traffic: 3002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6