Dear All,
I'm analyzing a ChIP-seq data, and I having some trouble filtering out "good" reads for us. Briefly, I've got a fastq file, then I sorted out reads that has 5' barcode sequence with no mismatch. Because the barcode sequence was not unique enough the reads aligned well even with barcode.
I'm trying to filter out reads with artificial barcode. So, I aligned the barcoded and the barcode trimmed reads respectively to the hg19 genome with exact match. Then, to get the not endogenous 5' barcoded reads I need to filter out the exactly aligned barcoded reads from the exactly aligned not barcoded reads.
Is there an easy was to do this? I'm a bit confused.
Thanks,
Laszlo
I think you're not the only one confused... Can you make your question clearer? (an example maybe?)
So, the goal is to retain reads in a fastq file that has non endogenous eighth basepair on the 5 prime end. The first step is to create a fastq file that contains only reads with 5' barcode. Next, is to align the fastq with or without 5' barcode sequence (trim BC) with perfect matches. If you take the trimmed reads without the BC aligned IDs (reads) you get rid of endogenous "barcode" sequences.
My problem is how to remove those reads... I managed to gather all read IDs that I want to keep.
I didn't understand why some reads will have BC and some won't, shouldn't they all contain the barcode?
If you have a list of IDs that you want to extract from a SAM file you can do it using a simple script or probably use Galaxy