Hello!
I have a bunch of reference contigs, obtained with SPAdes, and RNA-seq data analyzed with salmon for control and experimental sample (let it be c_ and 2_). For every gene, there are several contigs, differing by length. Here are some tables
Control:
~$ head ./salmon/c_Hu_quant/quant.sf
Name Length EffectiveLength TPM NumReads
GENE_A_NODE_1_length_24424_cov_435.483225_g0_i0 24424 572630.331 0.001707 106.950
GENE_A_NODE_2_length_2167_cov_448.144312_g0_i1 2167 238749.662 2.123121 55473.174
GENE_B_NODE_3_length_16211_cov_105.093072_g1_i0 16211 774366.222 0.044384 3761.324
GENE_B_NODE_4_length_1580_cov_123.258835_g2_i0 1580 100481.162 0.250976 2759.830
Experimental:
~$ head ./salmon/2_Hu_quant/quant.sf
Name Length EffectiveLength TPM NumReads
GENE_A_NODE_1_length_24424_cov_435.483225_g0_i0 24424 257379.238 0.001724 47.207
GENE_A_NODE_2_length_2167_cov_448.144312_g0_i1 2167 112914.001 2.275767 27341.409
GENE_B_NODE_3_length_16211_cov_105.093072_g1_i0 16211 360023.500 0.066320 2540.525
GENE_B_NODE_4_length_1580_cov_123.258835_g2_i0 1580 67844.237 0.004987 36.000
So, the first two contigs are for Gene A, and the second two contigs are for Gene B. (The presented data are not actual, just for representation)
Here is my question: Should I combine salmon data for each gene (e.g. sum NumReads together) or just filter and keep the longest one (NODE_1 for gene A and NODE_3 for gene B)? My next step is differential expression analysis.
Thank you.
Is this a
de novo
transcriptome assembly? Then potentially what you are referring to as "contigs" are assembled transcripts. Have you done the due diligence of making sure you have a reasonably non-redundant reference after the assembly?The two contigs in each case seem to have size difference of an order of magnitude? Are you sure the smaller of the two is not present in the larger "contig"?
Yes, this is de novo. Completeness is 92% , according to BUSCO.
The values here are examples, but yes, shorter contigs could be found in bigger (in the same gene).
Can you clarify if this is a pro- or eukaryotic organism? Did you use
rnaSPAdes
since you have RNAseq data?Yes, I used rnaSPAdes. I performed a hybrid assembly using 6 samples with Illumina short reads and one sample with Oxford Nanopore long reads.
Eukaryotic
Could you please have a look a part of the real data?
These 20 transcripts belong to the same domain according to the Interproscan annotation. Also, I have the correspondence table between transcript_ID and gene.
If I am going to use DESeq2, should I provide all the data (without any filtering) despite there being duplicates in genes?