Salmon and SPAdes contigs filtering
1
1
Entering edit mode
13 months ago
Skeetul ▴ 10

Hello!

I have a bunch of reference contigs, obtained with SPAdes, and RNA-seq data analyzed with salmon for control and experimental sample (let it be c_ and 2_). For every gene, there are several contigs, differing by length. Here are some tables

Control:

~$ head ./salmon/c_Hu_quant/quant.sf

Name                                                   Length  EffectiveLength  TPM     NumReads 
GENE_A_NODE_1_length_24424_cov_435.483225_g0_i0        24424   572630.331   0.001707        106.950  
GENE_A_NODE_2_length_2167_cov_448.144312_g0_i1         2167    238749.662   2.123121        55473.174 
GENE_B_NODE_3_length_16211_cov_105.093072_g1_i0        16211   774366.222   0.044384        3761.324 
GENE_B_NODE_4_length_1580_cov_123.258835_g2_i0         1580    100481.162   0.250976        2759.830

Experimental:

~$ head ./salmon/2_Hu_quant/quant.sf
Name                                                  Length  EffectiveLength TPM     NumReads
GENE_A_NODE_1_length_24424_cov_435.483225_g0_i0       24424   257379.238      0.001724        47.207
GENE_A_NODE_2_length_2167_cov_448.144312_g0_i1        2167    112914.001      2.275767        27341.409
GENE_B_NODE_3_length_16211_cov_105.093072_g1_i0       16211   360023.500      0.066320        2540.525
GENE_B_NODE_4_length_1580_cov_123.258835_g2_i0        1580    67844.237       0.004987        36.000

So, the first two contigs are for Gene A, and the second two contigs are for Gene B. (The presented data are not actual, just for representation)

Here is my question: Should I combine salmon data for each gene (e.g. sum NumReads together) or just filter and keep the longest one (NODE_1 for gene A and NODE_3 for gene B)? My next step is differential expression analysis.

Thank you.

salmon SPAdes • 1.1k views
ADD COMMENT
0
Entering edit mode

Is this a de novo transcriptome assembly? Then potentially what you are referring to as "contigs" are assembled transcripts. Have you done the due diligence of making sure you have a reasonably non-redundant reference after the assembly?

The two contigs in each case seem to have size difference of an order of magnitude? Are you sure the smaller of the two is not present in the larger "contig"?

ADD REPLY
0
Entering edit mode

Yes, this is de novo. Completeness is 92% , according to BUSCO.

The values here are examples, but yes, shorter contigs could be found in bigger (in the same gene).

ADD REPLY
0
Entering edit mode

Can you clarify if this is a pro- or eukaryotic organism? Did you use rnaSPAdes since you have RNAseq data?

ADD REPLY
0
Entering edit mode

Yes, I used rnaSPAdes. I performed a hybrid assembly using 6 samples with Illumina short reads and one sample with Oxford Nanopore long reads.

Eukaryotic

ADD REPLY
0
Entering edit mode

Could you please have a look a part of the real data?

These 20 transcripts belong to the same domain according to the Interproscan annotation. Also, I have the correspondence table between transcript_ID and gene.

If I am going to use DESeq2, should I provide all the data (without any filtering) despite there being duplicates in genes?

Name    Length  EffectiveLength TPM     NumReads

NODE_456_length_8290_cov_370.090301_g98_i1      8290    5900.182        0.261104        38.744
NODE_1730_length_6276_cov_157.405771_g534_i0    6276    954.921 0.110088        2.644
NODE_4537_length_5053_cov_966.171084_g1083_i6   5053    1877.920        0.033603        1.587
NODE_6697_length_4585_cov_44.696809_g1440_i5    4585    3904.000        0.000000        0.000
NODE_7709_length_4427_cov_126.252641_g353_i10   4427    3746.000        0.000000        0.000
NODE_16608_length_3572_cov_86.100029_g1440_i7   3572    2891.000        0.000000        0.000
NODE_17811_length_3489_cov_661.277810_g5031_i2  3489    3402.003        0.000000        0.000
NODE_30883_length_2847_cov_428.000000_g618_i4   2847    681.000 6.502934        111.373
NODE_33037_length_2767_cov_654.084261_g3701_i3  2767    681.000 0.722930        12.381
NODE_43008_length_2454_cov_130.757245_g13642_i0 2454    17047.841       0.137288        58.861
NODE_65189_length_1923_cov_920.482162_g3701_i7  1923    681.000 1.387578        23.765
NODE_102881_length_1185_cov_127.660971_g1440_i8 1185    504.000 0.826592        10.477
NODE_127369_length_766_cov_2.112554_g46684_i0   766     172.000 0.000000        0.000
NODE_187653_length_312_cov_3.263598_g104799_i0  312     45.000  0.000000        0.000
NODE_190798_length_309_cov_1.957627_g107943_i0  309     43.000  0.000000        0.000
NODE_244353_length_241_cov_3.714286_g161491_i0  241     59.000  0.000000        0.000
ADD REPLY
1
Entering edit mode
13 months ago

The general answer is that you have to decide whether you want a gene level analysis (sum over all transcripts) or a transcript level analysis where you keep the transcripts as they are.

Keeping the longest transcript would not benefit you in any way.

Salmon has already assigned counts to each transcript with its internal algorithm; there is no reason to keep just one transcript per gene.

That being said, since these seem to be assembled transcripts, I would do a gene-level analysis, as the transcripts may be less accurate than typically expected.

ADD COMMENT
0
Entering edit mode

Thank you for your reply!

Gene level analysis is what we are interested in.

Could you please have a look a part of the real data?

These 20 transcripts belong to the same domain according to the Interproscan annotation. Also, I have the correspondence table between transcript_ID and gene.

If I am going to use DESeq2, should I provide all the data (without any filtering) despite there being duplicates in genes?

Name    Length  EffectiveLength TPM     NumReads

NODE_456_length_8290_cov_370.090301_g98_i1      8290    5900.182        0.261104        38.744
NODE_1730_length_6276_cov_157.405771_g534_i0    6276    954.921 0.110088        2.644
NODE_4537_length_5053_cov_966.171084_g1083_i6   5053    1877.920        0.033603        1.587
NODE_6697_length_4585_cov_44.696809_g1440_i5    4585    3904.000        0.000000        0.000
NODE_7709_length_4427_cov_126.252641_g353_i10   4427    3746.000        0.000000        0.000
NODE_16608_length_3572_cov_86.100029_g1440_i7   3572    2891.000        0.000000        0.000
NODE_17811_length_3489_cov_661.277810_g5031_i2  3489    3402.003        0.000000        0.000
NODE_30883_length_2847_cov_428.000000_g618_i4   2847    681.000 6.502934        111.373
NODE_33037_length_2767_cov_654.084261_g3701_i3  2767    681.000 0.722930        12.381
NODE_43008_length_2454_cov_130.757245_g13642_i0 2454    17047.841       0.137288        58.861
NODE_65189_length_1923_cov_920.482162_g3701_i7  1923    681.000 1.387578        23.765
NODE_102881_length_1185_cov_127.660971_g1440_i8 1185    504.000 0.826592        10.477
NODE_127369_length_766_cov_2.112554_g46684_i0   766     172.000 0.000000        0.000
NODE_187653_length_312_cov_3.263598_g104799_i0  312     45.000  0.000000        0.000
NODE_190798_length_309_cov_1.957627_g107943_i0  309     43.000  0.000000        0.000
NODE_244353_length_241_cov_3.714286_g161491_i0  241     59.000  0.000000        0.000
ADD REPLY

Login before adding your answer.

Traffic: 2007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6