Hi all,
I routinely check my transcriptomes (from microscopic invertebrates) for exogenous contamination by blasting for COI and 18S and then looking to see if I get non-target hits and from what organism they came from. I'd like to establish a pipeline where I run salmon to get a quant.sf file for each assembly and then add the TPM and read count to the Trinity fasta header.
Here's what my Trinity fasta headers look like
TRINITY_DN8_c0_g1_i2
TRINITY_DN8_c0_g1_i1
TRINITY_DN8_c0_g1_i4
TRINITY_DN8_c0_g1_i3
TRINITY_DN8_c0_g1_i6
TRINITY_DN8_c0_g1_i7
TRINITY_DN18_c0_g1_i5
TRINITY_DN18_c0_g1_i2
TRINITY_DN18_c0_g1_i3
TRINITY_DN18_c0_g1_i6
Here's what the salmon quant.sf file looks like:
TRINITY_DN8_c0_g1_i2 510 285.832 7.037047 218.113
TRINITY_DN8_c0_g1_i1 1155 926.015 13.435710 1349.141
TRINITY_DN8_c0_g1_i4 1390 1161.015 1.133691 142.729
TRINITY_DN8_c0_g1_i3 698 469.444 12.193151 620.695
TRINITY_DN8_c0_g1_i6 563 336.582 6.744577 246.164
TRINITY_DN8_c0_g1_i7 1076 847.015 7.938752 729.159
TRINITY_DN18_c0_g1_i5 773 544.151 1.097282 64.747
TRINITY_DN18_c0_g1_i2 338 134.304 0.205994 3.000
TRINITY_DN18_c0_g1_i3 1452 1223.015 6.587060 873.579
I know there should be an easy script solution to add the last two fields from each line of the the quant.sf file to the corresponding fasta header but I haven't been able to figure this out. For what it's worth the is the exact same number of sequences in the fasta file as lines in the quant.sf file and they are in the same order.
Anyone who has figured this out or any scripting gurus who could lend some assistance would be greatly appreciated.
Thanks!
Kevin