Question

Do read pairs matter in alignment?

1

Entering edit mode

10.1 years ago

mmorenozambrano ▴ 10

Hello,

I have this bam file that originally contained paired sequences, after converting it into fastq sequences I got 2 different files. My idea is to make an alignment with a chromosome sequence, and from there obtain a VCF. My question is if the sequence that I will use will matter in my future results?, there is a way to know which sequence should I use?

I got the sequences using samtools:

bam2fastq -o blah_aligned_R#.fastq --no-unaligned Bamfilename.bam

EDIT:

Thank you very much for the comments, I will explain myself better (and sorry for any misunderstanding, but this is the first time that I am working in bioinformatics).

I am wanting to see the occurrence of KRAS-variant mutation (as described by Ratner et al, in http://www.nature.com/onc/journal/v31/n42/pdf/onc2011539a.pdf) in chromosome 12 in ovarian cancer cell lines. For this, I have downloaded some bam files from the nci60-project (http://watson.nci.nih.gov/projects/nci60/wes/BAMS/). These bam files are presented as paired sequences, so after extracting chromosome 12 information from them (using the bai files that are in the same webpage as index) by the code:

$ samtools view  IGR-OV1_reord_mdups_ralgn_fmate_recal.bam chr12>chr12.bam

Then, I sorted the bam file, and instead of using the original code that I posted at the beginning, I converted them to fastq using:

$ bedtools bamtofastq -i chr12.qsort.bam \
                      -fq chr12.end1.fq \
                      -fq2 chr12.end2.fq

So now, I have two sequences in fastq format per each original bam file.

My following steps, that I have in mind, is to align these fastq files with the human assembly 37, convert then again to bam files, and make the variant calling with mpileup so from there start with the annotation using annovar or something alike. So my questions in fact are:

Is my sort of pipeline acceptable?
Should I use both sequences for the variant calling?

Thank you in advance for your help and comments

next-gen alignment sequence • 3.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by mmorenozambrano ▴ 10

2

Entering edit mode

I assume you got a pair of fastq files and one file with unaligned reads. You should use both fastq files and align them in a paired end mode.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by Ashutosh Pandey 12k

1

Entering edit mode

I am not sure I can understand the question(s). Here's my guess.

Question in the title: Do read pairs matter in alignment? Answer: Yes. Follow the suggestion of Ashutosh Pandey
First question in the post "the sequence that I will use will matter in my future results?". I assume you are talking about the reference sequence. Yes. Everything you will obtain will be referred to the sequence you use. Is like when you have to point out where New York is, and you have to choose on which map you are going to work. Different maps might give different results, especially if they are very old!
Second question in the post: there is a way to know which sequence should I use? Quite often I use the rule: the newest!

You might consider clarifying the questions, so people can help you more efficiently!

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by Fabio Marroni ★ 3.0k

1

Entering edit mode

Note to Ashutosh and Fabio: The post seems to have been updated.

To the original question, you can simply call variants with the original BAM file and restrict the caller to the region around the KRAS gene. That way you don't need to bother even remapping things.

Note that otherwise the ideal method is to extract reads as pairs, since aligners can use the pairing information to not mistakenly think that coverage is doubled where the mates overlap.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

I agree 100% with Devon Ryan.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by Fabio Marroni ★ 3.0k

score 2 · Accepted Answer · 2015-03-20

2

Entering edit mode

10.1 years ago

Bioinfosm ▴ 620

The steps sound fine, and as others have suggested, using both read1 and read2 for alignment is more efficient. The paired mapping allows for better read placement, especially over repetitive or homologous regions.

Am not sure why you dont directly use the aligned BAM file to call variants on chr12 and then annotate them or intersect for any falling in KRAS?

If it is just a single position in KRAS, seems the paper you ref its a 3' UTR mutation, you can simply genotype that coordinate (chromosome-position) on all the BAM files and get a quick answer if the mutation is present/absent.

Hope this helps.

ADD COMMENT • link 10.1 years ago by Bioinfosm ▴ 620

0

Entering edit mode

Thank you very much for your answers. The reason why I am not calling variants directly from the BAM files, is because a suggestion of my promoter, who told me that I should demonstrate that I am able to remap the BAM files going first by their transformation to fastq files. But know that you have helped me, I will put in practice the different approaches to compare my results, and see (hopefully) that my remaps are going to be the same.

ADD REPLY • link 10.1 years ago by mmorenozambrano ▴ 10

0

Entering edit mode

Sure, or even better. If you are not sure what aligner was originally used or have a better one to try! You can use something like samtools flagstat and compare how your aligned BAM file is same/better than the original one.

ADD REPLY • link 10.1 years ago by Bioinfosm ▴ 620