Question

How To Cluster Mate Pair, Paie End And Single End Reads From Single File??

0

Entering edit mode

11.5 years ago

HG ★ 1.2k

Hi, I made mate pair illumina sequencing. In fastq file i checked mate pair, paie end and single end reads are present. Can any one tell me how to separate mate pair, paie end and single end reads from that single fastq file. ??

Thank you advance

illumina fastq • 4.6k views

ADD COMMENT • link updated 11.5 years ago by cts ★ 1.7k • written 11.5 years ago by HG ★ 1.2k

0

Entering edit mode

Is this mix of reads a result of trimming or did something go amiss during the library prep. or sequencing? Mate-pairs have a different orientation than paired-end, so you could just exploit that during mapping. If you have single-end reads in there, you'll need to explicitly remove them. There are threads elsewhere on this forum about syncing fastq files (see How to sort two mate pair (fastq) files so that the order of the identifiers is the same? and Combining the paired reads from Illumina run).

ADD REPLY • link 11.5 years ago by Devon Ryan 105k

0

Entering edit mode

It may be during sequencing or may be due to the time of library prep. I dont have so much prior information but while mapping i can see some reads are pair end some are mate pair and some of them are single end. Now i want to separate all three types into 3 file from a single fastq file. Is it possible ?? Please also let me know how can i remove single end from a mixture of reads .

ADD REPLY • link 11.5 years ago by HG ★ 1.2k

0

Entering edit mode

What you're actually observing is that some of the reads align better in one orientation than what you expect and, for still others, one of the reads simply won't align so the aligner just goes ahead and aligns the mate as a singleton. This doesn't mean that you have a mix of reads in the fastq file.

ADD REPLY • link 11.5 years ago by Devon Ryan 105k

score 1 · Answer 1 · 2013-11-28

Unfortunately with Illumina matepairs there is always a mix of paired-end and mate pair. You can perform some preprocessing steps to try and segregate based on the presence of the adaptor sequence. I've used nextclip for doing this, but it is specific for the Nextera mate pair protocol. After that to remove the final contaminants you will need to do an alignment against a reference genome and then look at the observed insert sizes. You can do this by looking at the tlen column in a sam file (the 9th column), if the reads are paired this should equal the insert size, if its a single mapped read the column will be 0.

below is a sample awk script that will separate a sam file into two based on the insert size of paired reads. It's not quite complete, you may need to add in some extra logic to get it to work properly. I also haven't tested it so I apologise in advance if there is a bug in it.

awk -F'\t' '
function abs(x){
    return ((x < 0.0) ? -x : x)
}     
/^@/{ 
    print $0 >>"pe.sam"
    print $0 >>"mp.sam"
    next
}
{
    if (abs($9) < 500) {  # <-- change number here to be the cutoff between paired-end and matepair
        print $0 >>"pe.sam"
    }else{
        print $0 >> "mp.sam"
    }
}' file.sam