I am working with multiple bam files which is aligned to my reference (using Bowtie2). I am planning to move forward to variant calling using picard for downstream GATK analysis. However, I realized that my bam files don't have @RG header info (used the following link: https://www.broadinstitute.org/gatk/guide/article?id=1317)
My situation: I have multiple bam files (1 bam file produce from 1 fasta file which was from 1 biological sample). So, if one of my raw fasta had following information in the header: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
which I think is info in casava 1.8 format and has following meanings.
EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 'x'-coordinate of the cluster within the tile
197393 'y'-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read is filtered, N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence
How, should I assign names so the raw sequences from the same flowcell/lane can be used by GATK to account for any variablity? Also, I don't want to loose any header information that is useful for sample identification or analysis.
I used the following command and it seems to work, but I am not exactly sure which/what information from the fasta records needs to be added to ID, SM, PL, LB, PU, etc.
java -jar picard.jar AddOrReplaceReadGroups I=BowtieOut_Sp154-4g.sorted.1.bam ID=group1 SM=sample1 PL=ILLUMINA LB=Sp154 PU=unit1 O=bamforGATK.bam
Thanks in advance :)
I have been trying to assign appropriate
@RG
values to my bam files before proceedings to any further analyses on my bam files (I could also do realignment using BWA or Bowtie2 if necessary).Say, I have two populations and have sequenced gDNA from 6 biological samples (per population). I obtain raw fasta files which have following metadata information:
Population MA (6 samples) with following sample:metadata information
Population SP (6 samples) with following samples:metadata information
My understanding is that the metadata represents the following information in casava 1.8 format: https://en.wikipedia.org/wiki/FASTQ_format
I then followed information on this link https://www.broadinstitute.org/gatk/guide/article?id=1317 to assign appropriate @RG information to downstream GATK analysis
For sample SpNor33-16:
For sample SP21: shares the same ID =
83_ D0D0MACXX_5
(as it was sequenced in the same flowcell, lane name & number) but is a different biological sample (SpNor33) and also belongs to the same population (Sp).For sample Sp154:
@HWI-ST588:83:D0D0MACXX:4:1101:1431:2134 1:N:0:CGTACG
For sample MA625:
@HWI-ST588:82:D0CWRACXX:8:1101:1478:2200 1:N:0:ATGTCA
My situation: Am I assigning the
@RG
information appropriately? Why/why not? I am confused about the LB part (should it be same sample sequenced in different library? or something else). If assigning it as "population - SP vs. MA" is not correct what should I do? What if I want to provide a "population level-tag" to each biological sample?It's been a long question, but if you can help in any way. Thanks in advance!