Hello,
I have this bam file that originally contained paired sequences, after converting it into fastq sequences I got 2 different files. My idea is to make an alignment with a chromosome sequence, and from there obtain a VCF. My question is if the sequence that I will use will matter in my future results?, there is a way to know which sequence should I use?
I got the sequences using samtools:
bam2fastq -o blah_aligned_R#.fastq --no-unaligned Bamfilename.bam
EDIT:
Thank you very much for the comments, I will explain myself better (and sorry for any misunderstanding, but this is the first time that I am working in bioinformatics).
I am wanting to see the occurrence of KRAS-variant mutation (as described by Ratner et al, in http://www.nature.com/onc/journal/v31/n42/pdf/onc2011539a.pdf) in chromosome 12 in ovarian cancer cell lines. For this, I have downloaded some bam files from the nci60-project (http://watson.nci.nih.gov/projects/nci60/wes/BAMS/). These bam files are presented as paired sequences, so after extracting chromosome 12 information from them (using the bai files that are in the same webpage as index) by the code:
$ samtools view IGR-OV1_reord_mdups_ralgn_fmate_recal.bam chr12>chr12.bam
Then, I sorted the bam file, and instead of using the original code that I posted at the beginning, I converted them to fastq using:
$ bedtools bamtofastq -i chr12.qsort.bam \
-fq chr12.end1.fq \
-fq2 chr12.end2.fq
So now, I have two sequences in fastq format per each original bam file.
My following steps, that I have in mind, is to align these fastq files with the human assembly 37, convert then again to bam files, and make the variant calling with mpileup so from there start with the annotation using annovar or something alike. So my questions in fact are:
- Is my sort of pipeline acceptable?
- Should I use both sequences for the variant calling?
Thank you in advance for your help and comments
I assume you got a pair of fastq files and one file with unaligned reads. You should use both fastq files and align them in a paired end mode.
I am not sure I can understand the question(s). Here's my guess.
You might consider clarifying the questions, so people can help you more efficiently!
Note to Ashutosh and Fabio: The post seems to have been updated.
To the original question, you can simply call variants with the original BAM file and restrict the caller to the region around the KRAS gene. That way you don't need to bother even remapping things.
Note that otherwise the ideal method is to extract reads as pairs, since aligners can use the pairing information to not mistakenly think that coverage is doubled where the mates overlap.
I agree 100% with Devon Ryan.