Question

How to trim transcripts using information from NCBI contamination screen report

0

Entering edit mode

7 months ago

Lada ▴ 30

Hi guys,

so I decided to upload my transcriptomes (non model animals without reference genomes) to TSA but obviously I didn't do a good job with Trimmomatic for some reason so I have to do some trimming of my transcripts. For example out of 139,196 sequences, 1624 sequences has to be trimmed.

The NCBI report for all these contaminated sequences gives information about sequence ID, length, span of contamination and source of contamination.

>    **Sequence name**                    length           **span(s)**       apparent source

   > **TRINITY_DN21678_c0_g1_i1**    2529   **2497..2529**  adaptor:multiple

   > ****TRINITY_DN21678_c0_g1_i4** 1222    **1190..1222**  adaptor:multiple

etc...

Most of the bases to be trimmed are at the beginning or end of the transcript.

I am not very proficient in coding so can someone help me out with the script and program that I have to use so my transcriptome will be screened for the Sequence name (column A) and then when that particular sequence is found in the next step bases from the designated span (column C) will be trimmed. So I guess my inputs could be assembly in fasta file and tsv table with transcript id and desired span to be trimmed.

I see this question was asked here, but it was more than 5 years ago + still not sure how to exactly do that.

RNAseq assembly transcriptome contamination • 742 views

ADD COMMENT • link 7 months ago by Lada ▴ 30

score 1 · Answer 1 · 2024-04-23

1

Entering edit mode

7 months ago

GenoMax 147k

I didn't do a good job with Trimmomatic for some reason so I have to do some trimming of my transcripts.

If you had extraneous sequence in your data before running trinity then can you trust the results? Ideally you would go back, clean the data before re-doing the assembly.

If that is not an option then consider dropping all 1624 sequences that contain the adapter. You seem to have ample number of transcripts.

You could also use bbduk.sh (or perhaps fastp) against those 1624 sequences (or your entire set of transcripts) to remove any adapters.

ADD COMMENT • link 7 months ago by GenoMax 147k

0

Entering edit mode

Thank you and yes, I agree going step back is the best option but I am in a hurry at the moment so I just need to stick to the transcriptomes I already have.

ADD REPLY • link 7 months ago by Lada ▴ 30

0

Entering edit mode

i checked both of these programs (bbduk and fastp) and it seems they use reads as inputs (fastq formats)... how can I trim my transcripts (fasta file) ?

ADD REPLY • link 7 months ago by Lada ▴ 30

1

Entering edit mode

bbduk.sh can use the transcripts as input. There is a adapters.fa included in the resources directory of the distribution. You can do something like

bbduk.sh -Xmx6g in=trinity.fa out=cleaned.fa ref=adapters.fa ktrim=rl k=23

Using ktrim=rl because you said that the adapters were at the beginning and end of reads. Otherwise you can individually do ktrim=r and then ktrim=l.

ADD REPLY • link 7 months ago by GenoMax 147k

0

Entering edit mode

Thank you very much! I tried it out! This is a handy tool useful for many different applications, and in regards to my problem - although it "thinks" these are reads because I see it use this term in the reports, it works perfectly fine with transcripts too. From what I see it is trimming too much so I have to read a little bit more into the documentation and figure out how to give the program exact instructions on which transcript to trim on which site but this is a very good start!

ADD REPLY • link 7 months ago by Lada ▴ 30