Hi,
First of all, I am not a bioinformatician or computational person, I am a molecular biologist. I have some fastq.gz files and I generated a fastQC report that says that I have lots of overrepresented sequences (> 50%!). Probably due to the nature of the genome (there is no reference genome, and my guess is that the restriction enzyme used has favoured the sequencing of repetitive elements (??)).
What I am interested is in knowing how many unique sequences/reads have certain number of copies (coverage), to see how many unique reads have a moderate number of copies (x20-50 copies). Any idea of how to get this information? Would it be possible through a command? Would it be possible to produce a txt file with two columns, one with number of unique sequences and second column with number of copies on those unique sequences? Maybe, in other words, what I am trying to get is the distribution of number of copies in unique sequences.
Thanks in advance,
Ángela
Please take a look at these blog posts from author's of FastQC.
Duplication (https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/ )
Positional bias ( https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/ )
You could de-duplicate this data if you want to count copies of reads with identical sequences ( see: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) You would use the program like this
Reads that absorbed duplicates will have "copies=X" appended to the end of fastq header to indicate how many reads they represent (including themselves, so the minimum you would see is 2).