Question

How To Sort Two Mate Pair (Fastq) Files So That The Order Of The Identifiers Is The Same?

7

Entering edit mode

14.1 years ago

Steffi ▴ 590

If there are reads that are just in one of the two files I would like to remove them of the file and store somewhere their ID.

fastq paired sort • 18k views

ADD COMMENT • link updated 14.1 years ago by Steffi ▴ 10 • written 14.1 years ago by Steffi ▴ 590

score 5 · Answer 1 · 2011-09-28

5

Entering edit mode

14.1 years ago

Pierre Lindenbaum 166k

linearize your two fastq files with awk and create a new column with a common "key" (here the name before "/") and sort on the key:

gunzip -c file1.fastq.gz |\
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\
awk '{i=index($1,"/"); printf("%s\t%s\n",substr($1,1,i-1),$0);}' |\
sort -k1,1 -t "   " > sorted1.txt

#same for  file2.fastq.gz
(..) > sorted2.txt

join both files with unix join:

join -t ' ' -1 1 -2 1  sorted1.txt  sorted2.txt

and recreate the two fastq files with cut and awk.

ADD COMMENT • link 14.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks a lot for your help! the code worked perfectly when changing the following options: I just used "sort -k1,1 " and "join -1 1 -2 1 sorted1.txt sorted2.txt" (so always without "t" option). Could you maybe also give me a hint how to recreate the two fastq files with cut and awk? Unfortunately I am yet not very fluent with linux shell commands. One additional question: what would happen if I have one ID in one file which is not present in the other file?

ADD REPLY • link 14.1 years ago by Steffi ▴ 590

Istvan Albert · Answer 2 · 2011-09-28

1

Entering edit mode

14.1 years ago

Steffi ▴ 10

Following up Pierre's post, once when you have your joined sorted file, I recreated my two mate-pair files in the following way:

cat joined_sorted_file |awk '{print substr($2"\n"$3"\n"$4"\n"$5,1)}' > mate1_sort

cat joined_sorted_file |awk '{print substr($6"\n"$7"\n"$8"\n"$9,1)}' > mate2_sort

I am not sure if this is the most efficient way to do so but at least it works :)

ADD COMMENT • link updated 12.4 years ago by Istvan Albert 103k • written 14.1 years ago by Steffi ▴ 10

0

Entering edit mode

or you can replace awk by tr "\t" "\n"

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

or you can replace awk by cut -f 2,3,4,5 | tr "\t" "\n"

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 166k

Ram · Answer 3 · 2011-09-28

0

Entering edit mode

14.1 years ago

Zhidkov ▴ 600

Hi,

go through this: Selecting Random Pairs From Fastq?

Ilia

ADD COMMENT • link updated 6.2 years ago by Ram 45k • written 14.1 years ago by Zhidkov ▴ 600

1

Entering edit mode

Hi, the idea was to use the sorting approach provided in above thread: turn fasta to one line (tabular), then remove the '#/1' in seq1 library and '#/2' in seq2 library. After that sort each file on first column (corresponding to read ID), after sorting add back the '#/1' and '#/2' to end of first column and turn the tabular fastq back to original four lines per read format.

Ilia

ADD REPLY • link 14.1 years ago by Zhidkov ▴ 600

0

Entering edit mode

Actually I do not want to select a random subset but check if the sequences are in the same order and if not then order them so that the first read in the first file is the mate of the first read in the second file. Of course, I could do this within BioC,but if the fastq files are large, it takes quite a while to process them.

ADD REPLY • link 14.1 years ago by Steffi ▴ 590