Trim Sequences And Quality In Fastq File
3
1
Entering edit mode
10.8 years ago
abn4 ▴ 10

I have bunch of fastq files in the directory and i want to trim the sequence by 2 nucleotides and quality(if the read has 51 base pairs and ends-with CTG or TTG).here is what i wrote as shell script but i am getting some errors,need help as i am new to shell scripting

Input:

@HWI-ST1072:187:C35YUACXX:7:1101:1609:1983 1:N:0:ACAGTG
NGGAGAAAGAGAGTGTGTTTTTAGGGGGAGATTTTTAAAATGGTTGTTTTG
+
#0<BFFFFFFFFF&lt;BFFFIIFFFFFIIIBFFFFFIIFIIIIIFFBFFFFFF
@HWI-ST1072:187:C35YUACXX:7:1101:1747:1995 1:N:0:ACAGTG
NGGTTGTGGTGGTGGGTATTTGTAGTTTTATTTATTCGGGAGGTTGAGCTG
+
#0<BFFFFFFFFFFIIBFFIIIIIIFIIIFFIIFIIIFIIFIIFFFFIIFF

output:

@HWI-ST1072:187:C35YUACXX:7:1101:1609:1983 1:N:0:ACAGTG
NGGAGAAAGAGAGTGTGTTTTTAGGGGGAGATTTTTAAAATGGTTGTTT
+
#0<BFFFFFFFFF<BFFFIIFFFFFIIIBFFFFFIIFIIIIIFFBFFFF
@HWI-ST1072:187:C35YUACXX:7:1101:1747:1995 1:N:0:ACAGTG
NGGTTGTGGTGGTGGGTATTTGTAGTTTTATTTATTCGGGAGGTTGAGC
+
#0<BFFFFFFFFFFIIBFFIIIIIIFIIIFFIIFIIIFIIFIIFFFFII

CODE:

for sample in *.fastq;
do
    name=$(echo ${sample} | sed 's/.fastq//')
    while read line;do
            if [ ${line:0:1} == "@" ] ; then
                    head="${line}"
                    $echo $head
            elif [ "${head}" ] &amp;&amp; [ "${line}" ] ; then
                    length=${#line}
                    if [ "${length}" = 51 -a "${line}" =~ *CTG|*TTG ] ; then
                            sequence= substr($line,0,49)
                            #echo $sequence
                    fi
            elif [ ${line:0:1} == "+" ] ; then
                    plus="${line}"
                    #echo $plus
            elif [ "${plus}" ] && [ "${line}" ] ; then
                    quality= substr($line,0,49)
                    #echo $quality
            fi
            print "${head}\n${sequence}\n${plus}\n${quality}" > ${name}_new.fq
    done < $sample
done
unix • 5.5k views
ADD COMMENT
0
Entering edit mode

seqtk is the best tool for this very fast and accurate

ADD REPLY
1
Entering edit mode
10.8 years ago

Indeed, there are tools to do that. But reinventing the wheel is necessary if one wants to learn how to code. Here is one way to translate your idea into shell bash code:

for sample in *.fastq ; do
    while read line ; do
        if [[ "${line:0:1}" == "@" ]] ; then
            head=${line}
        elif [[ "${line:0:1}" == "+" ]] ; then
            plus="${line}"
        elif [[ -n "${head}" && -z "${plus}" ]] ; then
            if [[ "${#line}" == 51 && "${line}" =~ TTG$|CTG$ ]] ; then
                sequence=${line:0:49}
            fi
        elif [[ -n "${head}" && -n "${plus}" ]] ; then
            if [[ "${#line}" == 51 ]] ; then
                quality=${line:0:49}
                echo -e "${head}\n${sequence}\n${plus}\n${quality}"
            fi
            head=""
            plus=""
        fi
    done < "${sample}" > ${sample/.fastq/_new.fq}
done
ADD COMMENT
0
Entering edit mode
10.8 years ago

Are you familiar with FASTX-Toolkit?

http://hannonlab.cshl.edu/fastx_toolkit/

It can handle the 2 nt trim. It can trim based upon quality scores, but this is different than your "quality" definition. That said, you can use it to remove adapters, so you can try using defining your problematic sequences as short adapters.

The main advantage is that you won't have to worry about troubleshooting because lots of people use that program. Of course, I'm sure you can write the code in our language of preference (I personally would have done it in Perl), but you may just not be able to get external help with fixing your detailed code.

ADD COMMENT
0
Entering edit mode

Thanks! I will check it

ADD REPLY
0
Entering edit mode
10.8 years ago
Chris Cole ▴ 800

There are many pre-existing trimming tools. I'd use one of them first to see if they can do what you want. From the top of my head I can think of: trimmomatic, trim galore and fastx toolkit.

Also, you don't say why you want to trim those nucleotides from your reads? Care to elaborate?

ADD COMMENT

Login before adding your answer.

Traffic: 2434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6