Dear Biostars users,
We recently submitted data to a sequencing center and I have some concern about the Cufflinks/Cuffmerge/Cuffdiff pipeline that was run (PLEASE: no off-topic responses about how I shouldn't be using this pipeline and should use DESeq2, or some other pipeline - I already know this!).
This issue concerns whether it is appropriate to merge the transcripts.gtf files created by Cufflinks from multiple cell lines A549, HAP1, K562.
The sequencing facility used Cuffmerge to every transcripts.gtf from all cell lines into a merged.gtf file. Basically, I don’t know enough about the cufflinks/cuffmerge assembler, nor how different these cell lines are, to say whether it was appropriate to merge all of these together?
From what I understand, Cuffmerge is designed to merge full-length and partial transcript assemblies without ever merging transfrags that disagree on splicing structure - meaning Cuffmerge will merge two transfrags if they overlap and agree on splicing, and are in the same orientation.
So it seems to me that Cuffmerge will only merge features that are common between cell lines and discard those that are not. Am I just being paranoid?
UPDATE
Deanna M. Church @deannachurch suggested on Twitter that I should do some QC. Specifically, to start by looking for 1bp exons. I have to admit I'd never heard of micro-exons before. Although there are some well-documented examples of alternatively spliced micro-exons which are strongly conserved [PMID:25524026, PMID: 25525873, PMID: 26293963, PMID: 27565344] they are extremely rare.
The gene transfer format (GTF) 3rd field/column is feature (e.g. exon), while 4th is start and 5th is end so I looked for 1bp long micro-exons:
cat merged.gtf | awk '($3=="exon" && $5-$4+1 <2) {print}'
I wound up finding 36:
chr1 Cufflinks exon 244367233 244367233 . - . gene_id "XLOC_005396"; transcript_id "TCONS_00017260"; exon_number "1"; gene_name "ADSS"; oId "CUFF.3425.1"; nearest_ref "ENST00000366535"; class_code "j"; tss_id "TSS8564";
chr11 Cufflinks exon 71580167 71580167 . - . gene_id "XLOC_010505"; transcript_id "TCONS_00033349"; exon_number "2"; gene_name "KRTAP5-11"; oId "ENST00000616086"; nearest_ref "ENST00000616086"; class_code "="; tss_id "TSS16677"; p_id "P8666";
chr11 Cufflinks exon 76191778 76191778 . - . gene_id "XLOC_010591"; transcript_id "TCONS_00033697"; exon_number "3"; gene_name "WNT11"; oId "CUFF.6324.3"; nearest_ref "ENST00000621122"; class_code "="; tss_id "TSS16829"; p_id "P8751";
chr11 Cufflinks exon 76191778 76191778 . - . gene_id "XLOC_010591"; transcript_id "TCONS_00033701"; exon_number "3"; gene_name "WNT11"; oId "ENST00000621122"; nearest_ref "ENST00000621122"; class_code "="; tss_id "TSS16832"; p_id "P8751";
chr11 Cufflinks exon 101050949 101050949 . - . gene_id "XLOC_010791"; transcript_id "TCONS_00034359"; exon_number "4"; gene_name "PGR"; oId "CUFF.6606.8"; nearest_ref "ENST00000325455"; class_code "j"; tss_id "TSS17155";
chr11 Cufflinks exon 101050949 101050949 . - . gene_id "XLOC_010791"; transcript_id "TCONS_00034358"; exon_number "4"; gene_name "PGR"; oId "CUFF.6606.7"; nearest_ref "ENST00000325455"; class_code "j"; tss_id "TSS17155";
chr11 Cufflinks exon 101050949 101050949 . - . gene_id "XLOC_010791"; transcript_id "TCONS_00034351"; exon_number "4"; gene_name "PGR"; oId "CUFF.6606.10"; nearest_ref "ENST00000325455"; class_code "j"; tss_id "TSS17155";
chr11 Cufflinks exon 101050949 101050949 . - . gene_id "XLOC_010791"; transcript_id "TCONS_00034364"; exon_number "2"; gene_name "PGR"; oId "ENST00000617858"; contained_in "TCONS_00034358"; nearest_ref "ENST00000617858"; class_code "="; tss_id "TSS17156"; p_id "P8887";
chr14 Cufflinks exon 23567597 23567597 . + . gene_id "XLOC_015828"; transcript_id "TCONS_00049285"; exon_number "4"; gene_name "RP11-66N24.3"; oId "CUFF.9635.17"; nearest_ref "ENST00000555968"; class_code "j"; tss_id "TSS24869";
chr14 Cufflinks exon 90402778 90402778 . + . gene_id "XLOC_016406"; transcript_id "TCONS_00051142"; exon_number "1"; gene_name "CALM1"; oId "CUFF.10288.11"; nearest_ref "ENST00000447653"; class_code "o"; tss_id "TSS25799";
chr14 Cufflinks exon 24632719 24632719 . - . gene_id "XLOC_016882"; transcript_id "TCONS_00052676"; exon_number "3"; gene_name "GZMB"; oId "ENST00000382540"; nearest_ref "ENST00000382540"; class_code "="; tss_id "TSS26542"; p_id "P13537";
chr15 Cufflinks exon 43401850 43401850 . - . gene_id "XLOC_019280"; transcript_id "TCONS_00059371"; exon_number "1"; gene_name "TP53BP1"; oId "CUFF.10902.1"; nearest_ref "ENST00000382044"; class_code "j"; tss_id "TSS30129";
chr15 Cufflinks exon 99970355 99970355 . - . gene_id "XLOC_019985"; transcript_id "TCONS_00061773"; exon_number "1"; gene_name "ADAMTS17"; oId "CUFF.11729.1"; nearest_ref "ENST00000268070"; class_code "j"; tss_id "TSS31293";
chr16 Cufflinks exon 2770352 2770352 . + . gene_id "XLOC_020165"; transcript_id "TCONS_00062448"; exon_number "14"; gene_name "SRRM2"; oId "CUFF.11951.1"; nearest_ref "ENST00000301740"; class_code "j"; tss_id "TSS31606";
chr16 Cufflinks exon 89553267 89553267 . + . gene_id "XLOC_021282"; transcript_id "TCONS_00066046"; exon_number "16"; gene_name "SPG7"; oId "ENST00000620811"; nearest_ref "ENST00000620811"; class_code "="; tss_id "TSS33400"; p_id "P16770";
chr18 Cufflinks exon 9887458 9887458 . + . gene_id "XLOC_025718"; transcript_id "TCONS_00081021"; exon_number "3"; gene_name "TXNDC2"; oId "ENST00000611534"; nearest_ref "ENST00000611534"; class_code "="; tss_id "TSS40801"; p_id "P20539";
chr18 Cufflinks exon 9887458 9887458 . + . gene_id "XLOC_025718"; transcript_id "TCONS_00081023"; exon_number "4"; gene_name "TXNDC2"; oId "CUFF.15416.3"; nearest_ref "ENST00000611534"; class_code "j"; tss_id "TSS40801";
chr18 Cufflinks exon 9887458 9887458 . + . gene_id "XLOC_025718"; transcript_id "TCONS_00081022"; exon_number "3"; gene_name "TXNDC2"; oId "CUFF.15416.2"; nearest_ref "ENST00000611534"; class_code "j"; tss_id "TSS40801";
chr19 Cufflinks exon 19895471 19895471 . + . gene_id "XLOC_027317"; transcript_id "TCONS_00085993"; exon_number "3"; gene_name "ZNF253"; oId "CUFF.16659.4"; nearest_ref "ENST00000589717"; class_code "j"; tss_id "TSS43326";
chr19 Cufflinks exon 4454794 4454794 . - . gene_id "XLOC_028401"; transcript_id "TCONS_00090221"; exon_number "10"; gene_name "UBXN6"; oId "CUFF.16144.8"; nearest_ref "ENST00000301281"; class_code "j"; tss_id "TSS45254";
chr19 Cufflinks exon 36658555 36658555 . - . gene_id "XLOC_029002"; transcript_id "TCONS_00092356"; exon_number "5"; gene_name "ZNF461"; oId "CUFF.16940.4"; nearest_ref "ENST00000588268"; class_code "j"; tss_id "TSS46224";
chr2 Cufflinks exon 96695297 96695297 . + . gene_id "XLOC_030481"; transcript_id "TCONS_00097294"; exon_number "36"; gene_name "FER1L5"; oId "ENST00000623246"; nearest_ref "ENST00000623246"; class_code "="; tss_id "TSS48705"; p_id "P25170";
chr2 Cufflinks exon 96695297 96695297 . + . gene_id "XLOC_030481"; transcript_id "TCONS_00097291"; exon_number "37"; gene_name "FER1L5"; oId "CUFF.19029.6"; nearest_ref "ENST00000623246"; class_code "j"; tss_id "TSS48705";
chr2 Cufflinks exon 40512349 40512349 . - . gene_id "XLOC_032099"; transcript_id "TCONS_00102629"; exon_number "10"; gene_name "SLC8A1"; oId "CUFF.18487.9"; nearest_ref "ENST00000403092"; class_code "j"; tss_id "TSS51395";
chr21 Cufflinks exon 38006828 38006828 . - . gene_id "XLOC_036066"; transcript_id "TCONS_00114476"; exon_number "1"; gene_name "DSCR4"; oId "CUFF.21897.6"; nearest_ref "ENST00000398948"; class_code "j"; tss_id "TSS57549";
chr3 Cufflinks exon 50617514 50617514 . - . gene_id "XLOC_039610"; transcript_id "TCONS_00126194"; exon_number "4"; gene_name "CISH"; oId "CUFF.23701.1"; nearest_ref "ENST00000443053"; class_code "j"; tss_id "TSS63347";
chr4 Cufflinks exon 1730388 1730388 . + . gene_id "XLOC_040760"; transcript_id "TCONS_00129874"; exon_number "4"; gene_name "TACC3"; oId "ENST00000612220"; nearest_ref "ENST00000612220"; class_code "="; tss_id "TSS65159"; p_id "P32811";
chr4 Cufflinks exon 109816009 109816009 . + . gene_id "XLOC_041579"; transcript_id "TCONS_00132221"; exon_number "1"; gene_name "GAR1"; oId "CUFF.26055.9"; nearest_ref "ENST00000226796"; class_code "j"; tss_id "TSS66363";
chr4 Cufflinks exon 109816009 109816009 . + . gene_id "XLOC_041579"; transcript_id "TCONS_00132220"; exon_number "1"; gene_name "GAR1"; oId "CUFF.26055.10"; nearest_ref "ENST00000226796"; class_code "j"; tss_id "TSS66363";
chr4 Cufflinks exon 169663114 169663114 . + . gene_id "XLOC_041987"; transcript_id "TCONS_00133314"; exon_number "2"; gene_name "CLCN3"; oId "ENST00000504131"; nearest_ref "ENST00000504131"; class_code "="; tss_id "TSS66954"; p_id "P33590";
chr4 Cufflinks exon 120083285 120083285 . - . gene_id "XLOC_042957"; transcript_id "TCONS_00136111"; exon_number "6"; gene_name "MAD2L1"; oId "CUFF.26122.2"; nearest_ref "ENST00000296509"; class_code "j"; tss_id "TSS68414";
chr4 Cufflinks exon 144120490 144120490 . - . gene_id "XLOC_043118"; transcript_id "TCONS_00136504"; exon_number "3"; gene_name "GYPA"; oId "CUFF.26356.6"; nearest_ref "ENST00000535709"; class_code "j"; tss_id "TSS68639";
chr6 Cufflinks exon 42031246 42031246 . - . gene_id "XLOC_048613"; transcript_id "TCONS_00152503"; exon_number "5"; gene_name "CCND3"; oId "CUFF.29469.3"; nearest_ref "ENST00000511642"; class_code "j"; tss_id "TSS77022";
chr7 Cufflinks exon 99396376 99396376 . + . gene_id "XLOC_050386"; transcript_id "TCONS_00157909"; exon_number "11"; gene_name "ARPC1B"; oId "CUFF.31661.3"; nearest_ref "ENST00000451682"; class_code "j"; tss_id "TSS79796";
chr7 Cufflinks exon 44122282 44122282 . - . gene_id "XLOC_051487"; transcript_id "TCONS_00161237"; exon_number "11"; gene_name "POLD2"; oId "CUFF.31102.3"; nearest_ref "ENST00000406581"; class_code "j"; tss_id "TSS81506";
chrX Cufflinks exon 123697681 123697681 . - . gene_id "XLOC_059655"; transcript_id "TCONS_00184665"; exon_number "35"; gene_name "THOC2"; oId "CUFF.36719.6"; nearest_ref "ENST00000245838"; class_code "j"; tss_id "TSS93904";
I assume I should filter these exons out but what about next steps for QC; what does this finding suggest about the quality of the merged.gtf file?
UPDATE 2
I compared these with hg38_e87.gtf which was used with hg38.fa by Cufflinks and noticed 15 are there. A brief inspection shows that many of the ones in hg38_e87 are repeated in merged.gtf. For example, in the example above Chromosome 18 (gene_name: TXNDC2) is repeated 3 times. The first entry matches hg38_e87.gtf exactly. The next two entries have subtle difference: transcript_id, exon_number, oId, and an entirely new entry p_id. I'm not sure why Cuffmerge adds duplicates or if this creates problems downstream - it will certainly slow down cuffdiffs as more data/repeated-isoforms means substantially more memory and running time.
6 of the micro-exons that were in hg38_e87.gtf are no longer in the merged.gtf (SCN7A, ZDHHC11, MEF2C, KRT17, MPP2, MED25), which isn't a problem.
However, a number of new micro-exons arose in the merged.gtf file (ADSS, RP11-66N24.3, CALM1, TP53BP1, ADAMTS17, SRRM2, ZNF253, UBXN6, ZNF461, SLC8A1, DSCR4, CISH, MAD2L1, GYPA, CCND3, ARPC1B, POLD2, THOC2). So do I just filter out these and leave the rest because they were originally in hg38?
Thank you