Edit strings in FASTQ file
3
1
Entering edit mode
8.5 years ago
SOHAIL ▴ 410

I have a FASTQ file that contains

@SRR1101035.1.1 **1 length=100**
CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC
+SRR1101035.1.1 **1 length=100**
@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG*1@*?1?DG########(0(7<;FCGHC;=#--5A5?#############################

@SRR1101035.2.1 **2 length=100**
AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG
+SRR1101035.2.1 **2 length=100**
<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

I am new in this could you please tell me how to remove the last characters (surrounded by **s), expected results would be

@SRR1101035.1.1 
CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC
+SRR1101035.1.1 
@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG*1@*?1?DG########(0(7<;FCGHC;=#--5A5?#############################

@SRR1101035.2.1
AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG
+SRR1101035.2.1 
<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################
Linux • 3.6k views
ADD COMMENT
4
Entering edit mode
ADD REPLY
3
Entering edit mode
8.5 years ago
Prakki Rama ★ 2.7k

The following command should remove the unwanted text in the header

  sed -i 's/\([0-9]*\) length=100//g' test.fastq

Output:

$ sed -i 's/\([0-9]*\) length=100//g' test.fastq

@SRR1101035.1.1
CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC
+SRR1101035.1.1
@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG*1@*?1?DG########(0(7<;FCGHC;=#--5A5?#############################
@SRR1101035.2.1
AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG
+SRR1101035.2.1
<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################
ADD COMMENT
0
Entering edit mode

Edited my post to be more specific to the original question.

ADD REPLY
1
Entering edit mode
8.5 years ago
piet ★ 1.9k

could you please tell me how to remove the last characters

Please run fastq-dump with command line argument line argument '-F'. This will prevent that the "bold characters" are created.

fastq-dump -F SRR1101035 | less

see also this post

ADD COMMENT
1
Entering edit mode
8.5 years ago

Try this command. You can fix your problem and get rid of the redundant information found in the third lane, which is only useful to inflate your files

cat your_original_fasta_file  | paste - - - - | awk -v OFS="\t" ' {print $1,$4,"+",$8}' | tr "\t" "\n" > new_fasta_file
ADD COMMENT
0
Entering edit mode

Hi Antonio Franco, Could you please explain your command???

ADD REPLY
0
Entering edit mode

Of course..

cat your_original_fasta_file

# This will output the content of your file allowing the use of the next pipes or "|"

You need to take into consideration that every fasta file has always 4 separated lanes

paste - - - -

will join the 4 lanes in a single lane. This will facilitate the using of awk in the next lane as not carriage returns or number of lanes need to be considered after this step. This is a nice trick indeed learned from this forum

You can define in awk the input file separator. You do it by allowing awk to define a variable (-v) and the kind of separator. It is likely you can define the OFS separator within clauses without requiring the using of -v, though. OFS stands for Output Field Separator, and "\t" means that this separator will be a tab.

Then you need to figure out the whole fasta content in a single lane with each of the items separated to each other with a tab

Then you ask awk to print items $1 (the name), $4 (the sequence), the character "+", and finally $8 (the quality lane). In this case, it is essential to write a colon "," among every variable because this way awk includes a tab separating each of the items.

Items $2 and $3 are the *1 or *2 and the lenght=???? items. You can figure out why I needed to use $8

The fourth lane in a fastq file needs the "+" character, but nothing else to date. Sometimes you make a multiGb fastq file substantially smaller by erasing the extra content of this fourth lane

The last lane is a "tr" usual unix code that replaces each of the tab file ("\t") for a carriage return character ("\n"). This recomposed the original fastq file

The last command starting with the > is easy to understand. It writes your file with the modified content

ADD REPLY

Login before adding your answer.

Traffic: 2369 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6