How To Sort Two Mate Pair (Fastq) Files So That The Order Of The Identifiers Is The Same?
3
7
Entering edit mode
13.3 years ago
Steffi ▴ 580

If there are reads that are just in one of the two files I would like to remove them of the file and store somewhere their ID.

fastq paired sort • 17k views
ADD COMMENT
5
Entering edit mode
13.3 years ago

linearize your two fastq files with awk and create a new column with a common "key" (here the name before "/") and sort on the key:

gunzip -c file1.fastq.gz |\
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\
awk '{i=index($1,"/"); printf("%s\t%s\n",substr($1,1,i-1),$0);}' |\
sort -k1,1 -t "   " > sorted1.txt

#same for  file2.fastq.gz
(..) > sorted2.txt

join both files with unix join:

join -t ' ' -1 1 -2 1  sorted1.txt  sorted2.txt

and recreate the two fastq files with cut and awk.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your help! the code worked perfectly when changing the following options: I just used "sort -k1,1 " and "join -1 1 -2 1 sorted1.txt sorted2.txt" (so always without "t" option). Could you maybe also give me a hint how to recreate the two fastq files with cut and awk? Unfortunately I am yet not very fluent with linux shell commands. One additional question: what would happen if I have one ID in one file which is not present in the other file?

ADD REPLY
1
Entering edit mode
13.3 years ago
Steffi ▴ 10

Following up Pierre's post, once when you have your joined sorted file, I recreated my two mate-pair files in the following way:

cat joined_sorted_file |awk '{print substr($2"\n"$3"\n"$4"\n"$5,1)}' > mate1_sort

cat joined_sorted_file |awk '{print substr($6"\n"$7"\n"$8"\n"$9,1)}' > mate2_sort

I am not sure if this is the most efficient way to do so but at least it works :)

ADD COMMENT
0
Entering edit mode

or you can replace awk by tr "\t" "\n"

ADD REPLY
0
Entering edit mode

or you can replace awk by cut -f 2,3,4,5 | tr "\t" "\n"

ADD REPLY
0
Entering edit mode
13.3 years ago
Zhidkov ▴ 600

Hi,

go through this: Selecting Random Pairs From Fastq?

Ilia

ADD COMMENT
1
Entering edit mode

Hi, the idea was to use the sorting approach provided in above thread: turn fasta to one line (tabular), then remove the '#/1' in seq1 library and '#/2' in seq2 library. After that sort each file on first column (corresponding to read ID), after sorting add back the '#/1' and '#/2' to end of first column and turn the tabular fastq back to original four lines per read format.

Ilia

ADD REPLY
0
Entering edit mode

Actually I do not want to select a random subset but check if the sequences are in the same order and if not then order them so that the first read in the first file is the mate of the first read in the second file. Of course, I could do this within BioC,but if the fastq files are large, it takes quite a while to process them.

ADD REPLY

Login before adding your answer.

Traffic: 1798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6