Question

Trim Sequences And Quality In Fastq File

1

Entering edit mode

10.7 years ago

abn4 ▴ 10

I have bunch of fastq files in the directory and i want to trim the sequence by 2 nucleotides and quality(if the read has 51 base pairs and ends-with CTG or TTG).here is what i wrote as shell script but i am getting some errors,need help as i am new to shell scripting

Input:

@HWI-ST1072:187:C35YUACXX:7:1101:1609:1983 1:N:0:ACAGTG
NGGAGAAAGAGAGTGTGTTTTTAGGGGGAGATTTTTAAAATGGTTGTTTTG
+
#0<BFFFFFFFFF&lt;BFFFIIFFFFFIIIBFFFFFIIFIIIIIFFBFFFFFF
@HWI-ST1072:187:C35YUACXX:7:1101:1747:1995 1:N:0:ACAGTG
NGGTTGTGGTGGTGGGTATTTGTAGTTTTATTTATTCGGGAGGTTGAGCTG
+
#0<BFFFFFFFFFFIIBFFIIIIIIFIIIFFIIFIIIFIIFIIFFFFIIFF

output:

@HWI-ST1072:187:C35YUACXX:7:1101:1609:1983 1:N:0:ACAGTG
NGGAGAAAGAGAGTGTGTTTTTAGGGGGAGATTTTTAAAATGGTTGTTT
+
#0<BFFFFFFFFF<BFFFIIFFFFFIIIBFFFFFIIFIIIIIFFBFFFF
@HWI-ST1072:187:C35YUACXX:7:1101:1747:1995 1:N:0:ACAGTG
NGGTTGTGGTGGTGGGTATTTGTAGTTTTATTTATTCGGGAGGTTGAGC
+
#0<BFFFFFFFFFFIIBFFIIIIIIFIIIFFIIFIIIFIIFIIFFFFII

CODE:

for sample in *.fastq;
do
    name=$(echo ${sample} | sed 's/.fastq//')
    while read line;do
            if [ ${line:0:1} == "@" ] ; then
                    head="${line}"
                    $echo $head
            elif [ "${head}" ] &amp;&amp; [ "${line}" ] ; then
                    length=${#line}
                    if [ "${length}" = 51 -a "${line}" =~ *CTG|*TTG ] ; then
                            sequence= substr($line,0,49)
                            #echo $sequence
                    fi
            elif [ ${line:0:1} == "+" ] ; then
                    plus="${line}"
                    #echo $plus
            elif [ "${plus}" ] && [ "${line}" ] ; then
                    quality= substr($line,0,49)
                    #echo $quality
            fi
            print "${head}\n${sequence}\n${plus}\n${quality}" > ${name}_new.fq
    done < $sample
done

unix • 5.5k views

ADD COMMENT • link updated 6.7 years ago by Ram 44k • written 10.7 years ago by abn4 ▴ 10

0

Entering edit mode

seqtk is the best tool for this very fast and accurate

ADD REPLY • link 10.7 years ago by ancient_learner ▴ 680

Ram · Answer 1 · 2014-02-08

Indeed, there are tools to do that. But reinventing the wheel is necessary if one wants to learn how to code. Here is one way to translate your idea into shell bash code:

for sample in *.fastq ; do
    while read line ; do
        if [[ "${line:0:1}" == "@" ]] ; then
            head=${line}
        elif [[ "${line:0:1}" == "+" ]] ; then
            plus="${line}"
        elif [[ -n "${head}" && -z "${plus}" ]] ; then
            if [[ "${#line}" == 51 && "${line}" =~ TTG$|CTG$ ]] ; then
                sequence=${line:0:49}
            fi
        elif [[ -n "${head}" && -n "${plus}" ]] ; then
            if [[ "${#line}" == 51 ]] ; then
                quality=${line:0:49}
                echo -e "${head}\n${sequence}\n${plus}\n${quality}"
            fi
            head=""
            plus=""
        fi
    done < "${sample}" > ${sample/.fastq/_new.fq}
done

score 0 · Answer 2 · 2014-02-06

Are you familiar with FASTX-Toolkit?

http://hannonlab.cshl.edu/fastx_toolkit/

It can handle the 2 nt trim. It can trim based upon quality scores, but this is different than your "quality" definition. That said, you can use it to remove adapters, so you can try using defining your problematic sequences as short adapters.

The main advantage is that you won't have to worry about troubleshooting because lots of people use that program. Of course, I'm sure you can write the code in our language of preference (I personally would have done it in Perl), but you may just not be able to get external help with fixing your detailed code.

score 0 · Answer 3 · 2014-02-06

0

Entering edit mode

10.7 years ago

Chris Cole ▴ 800

There are many pre-existing trimming tools. I'd use one of them first to see if they can do what you want. From the top of my head I can think of: trimmomatic, trim galore and fastx toolkit.

Also, you don't say why you want to trim those nucleotides from your reads? Care to elaborate?

ADD COMMENT • link 10.7 years ago by Chris Cole ▴ 800