tools for unique reads generation from fastqs
2
0
Entering edit mode
5.8 years ago
genya35 ▴ 50

Hello,

I would like to generate a fasta file containing unique reads and count of each read occurrence, from two Illumina fastq files (forward and reverse). The next step is to blast the unique reads and group them together based on the results of the blast search. Could someone please suggest a tool that could accomplish this?

Thanks

next-gen • 2.0k views
ADD COMMENT
1
Entering edit mode

Are you sure you know what are you asking? That sounds like you want to extract unique reads and then count them. Please explain what do you want to do and if you have specific questions.

ADD REPLY
0
Entering edit mode
5.8 years ago
GenoMax 147k

You need clumpify.sh from BBMap suite to remove/count duplicate reads. See this: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

Once you have your set of unique fastq reads they can be easily converted to fasta format by using reformat.sh from BBMap suite. I am reasonably certain that the read header containing count numbers should be retained in following conversion.

reformat.sh in=your.fq.gz out=your.fa
ADD COMMENT
0
Entering edit mode

@genomax At what point do you recommend combing the two fastqs into one? Thanks

ADD REPLY
0
Entering edit mode

If these are pair-end reads then you should process them together with in1= and in2= directives and capture results in out1= and out2=. Only those reads where R1/R2 are identical would be considered duplicates. Remember to use addcount=t subs=0. Depending on size of your data files clumpify.sh can need a significant amount of memory.

ADD REPLY
0
Entering edit mode

My goal is come up with a list of unique reads with counts for the sample. I will use igblast in the next step to assign V-J usage and later group them and count them. At what point should I combine the unique reads from the two files? thanks

ADD REPLY
0
Entering edit mode

If you use clumpify.sh as intended, it will keep only one best copy of the duplicate reads and then add a count number to the header to show how many there were. So you do not need to do any combining.

ADD REPLY
0
Entering edit mode

Is there an easy way to sort the fasta output from the most common to the least common read? I ran fastp at default to post process the fastqs before I've used clumpy. Do you recommend any additional post processing? Thanks

ADD REPLY
0
Entering edit mode

You should use clumpify.sh on original un-processed data. Trimming your data may result in loss of some valuable information about duplication.

ADD REPLY

Login before adding your answer.

Traffic: 1640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6