Question

Filter fastq/sam/bam for reads

0

Entering edit mode

10.4 years ago

hlsz.laszlo ▴ 50

Dear All,

I'm analyzing a ChIP-seq data, and I having some trouble filtering out "good" reads for us. Briefly, I've got a fastq file, then I sorted out reads that has 5' barcode sequence with no mismatch. Because the barcode sequence was not unique enough the reads aligned well even with barcode.

I'm trying to filter out reads with artificial barcode. So, I aligned the barcoded and the barcode trimmed reads respectively to the hg19 genome with exact match. Then, to get the not endogenous 5' barcoded reads I need to filter out the exactly aligned barcoded reads from the exactly aligned not barcoded reads.

Is there an easy was to do this? I'm a bit confused.

Thanks,
Laszlo

reads ChIP-Seq filtering • 3.8k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by hlsz.laszlo ▴ 50

0

Entering edit mode

I think you're not the only one confused... Can you make your question clearer? (an example maybe?)

ADD REPLY • link 10.4 years ago by Asaf 10k

0

Entering edit mode

So, the goal is to retain reads in a fastq file that has non endogenous eighth basepair on the 5 prime end. The first step is to create a fastq file that contains only reads with 5' barcode. Next, is to align the fastq with or without 5' barcode sequence (trim BC) with perfect matches. If you take the trimmed reads without the BC aligned IDs (reads) you get rid of endogenous "barcode" sequences.

My problem is how to remove those reads... I managed to gather all read IDs that I want to keep.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by hlsz.laszlo ▴ 50

0

Entering edit mode

I didn't understand why some reads will have BC and some won't, shouldn't they all contain the barcode?

If you have a list of IDs that you want to extract from a SAM file you can do it using a simple script or probably use Galaxy

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Asaf 10k

Ram · Answer 1 · 2014-07-08

1

Entering edit mode

10.4 years ago

Istvan Albert 102k

First I'll say that this really does not sound quite right.

It is very unlikely that you could fully align reads that contain a barcode. Even though say a six base long k-mer is not that unique on its own, when paired to an existing location in the a genome it will form to a very unique construct that is very unlikely to match exactly. If you aligning it partially (locally) then it is a different issue altogether but those alignments will be more difficult to interpret correctly.

(IMHO if you can fully align your reads it means that don't actually have a barcode there.)

In general when splitting by barcode you need to identify the barcodes and split by those and not by aligning with or without barcodes.

As for the answer to your question search for extract fastq on this site, you'll get hits like this:

How To Extracting Fastq Sequence For Given Fastq Ids And Fastq File

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Istvan Albert 102k

1

Entering edit mode

Hi,

Sorry if I wasn't fully clear. So, the barcode (not equal to illumina adapter, index) ligation was very inefficient. From the raw fastq file (~20 m, 100 bp reads) only a minority (~3 m) contains the barcode. Moreover this barcode seems not to be unique (~ 1 m read with barcode aligned perfectly to hg19; this group I want to remove from my fastq). I know that the remaining read number is low, but it worth trying.

I collected the IDs from perfect matched reads containing barcode and the IDs of reads that aligned perfectly when I trimmed the barcode. Then I used Microsoft Access (not sure if it is the best) to print trimmed IDs that not have ID match in the BC ID group (to get reads containing artificial "barcode").

I'll try what you suggested.

Thanks,

Laszlo

ADD REPLY • link updated 5.5 years ago by Ram 44k • written 10.4 years ago by hlsz.laszlo ▴ 50

1

Entering edit mode

Well like I said, it does not matter whether the barcode itself matches the genome perfectly.

The issue here is why would a barcode+read also match the genome exactly. There is no simple explanation I can come up with to explain how 5% of your reads could come from a genomic location that, after being extended artificially with a barcode would still match perfectly.

I suspect that you think the matches are perfect when in fact they are not, could be all mismatches or soft clipped.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Istvan Albert 102k