Question

Fixing BAM file errors reported by Picard ValidateSamFile, and in featureCounts' outfile

1

Entering edit mode

5.2 years ago

Anand Rao ▴ 640

I am trying to generate gene counts using featureCounts v2.0.1. The syntax I used is a simple one, as shown below:

featureCounts -T 2 -a MtrunA17r5.0-ANR-EGN-r1.7.gtf -F GTF -o A17_T00_1_BASIC1_fC.txt A17_T00_1.bam

The summary file reads as follows:

cat A17_T00_1_BASIC1_fC.txt.summary 
Status  A17_T00_1.bam
Assigned    13377184
Unassigned_Unmapped 0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 1625691
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   334969
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    2112828

When I peeked into the actual counts file, I noticed some weird entries in column 1 instead of gene_IDs, as shown below. Only 49 out of 52724 lines contain such strange entries.

gene:MtrunA17Chr5g0436531   0
63135360-95f3-44e9-8358-c7975fb56306    92
gene:MtrunA17Chr5g0436551   0
.
.
gene:MtrunA17Chr6g0458631   0
64238042-c1c7-40e1-b07d-e5050f9b567e    0
gene:MtrunA17Chr6g0458661   0
.
.
gene:MtrunA17Chr7g0236431   0
25191532-34dd-4669-9d7b-a20c59e60855    1
gene:MtrunA17Chr7g0236441   0

So I went back to the BAM input file and attempted it's validation with Picard's ValidateSamFile, on a SLURM-based HPCC, as shown below. As visible, it gave me 1 ERROR, and also 49 warnings. Only warning #1 and warning #49 are shown below, rest being similar.

 srun --partition=high --time=1:00:00 --mem=24000 java -jar picard.jar ValidateSamFile I=A17_T00_1.bam MODE=VERBOSE
srun: job 22816732 queued and waiting for resources
srun: job 22816732 has been allocated resources
[Thu Jun 11 20:16:35 PDT 2020] picard.sam.ValidateSamFile INPUT=A17_T00_1.bam MODE=VERBOSE    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Jun 11 20:16:35 PDT 2020] Executing as aksrao@c8-62 on Linux 4.15.0-99-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26; Picard version: 2.7.1-SNAPSHOT
ERROR: Read groups is empty

WARNING: Read name HWI-ST797:117:D091UACXX:4:2203:11139:169433, A record is missing a read group
WARNING: Record 1, Read name HWI-ST797:117:D091UACXX:4:2203:11139:169433, NM tag (nucleotide differences) is missing
.
.
WARNING: Record 49, Read name HWI-ST797:117:D091UACXX:4:1107:17611:90721, NM tag (nucleotide differences) is missing
WARNING: Read name HWI-ST797:117:D091UACXX:4:2102:6277:14310, A record is missing a read group
Maximum output of [100] errors reached.
[Thu Jun 11 20:16:35 PDT 2020] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1012924416
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
srun: error: c8-62: task 0: Exited with exit code 1

Here are questions that I have for forum members. Thanks in advance!

1. Is the srun: error: c8-62: task 0: Exited with exit code 1 message indicative that PicardTools' ValidateSamFile execution is not proper? From this GATK discussion board post, and the corresponding explanation for return codes, it does not appear exit code 1 is for PicardTools execution per se, but I'm not sure?!

2. Regardless of any SLURM or installation problems, how do I fix my BAM file so error and warnings in ValidateSamFile do not recur, and the weird entries in featureCounts gene counts also do not recur. Looks like they are both related, i.e. those 49 warnings in ValidateSamFile for my BAM input, probably each correspond to the 49 weird entries in my featureCounts gene counts output ?

edit: I suppose it should suffice to follow instructions at this Broad GitHub Picard page?

3. If it is relevant, my BAM file was generated by STAR mapping, and I know my FASTQ input files were all OK, based on FastQValidator results. An additional detail is that I ran pass 1 to collect new Splice Junctions for 100+ libraries, concatenated them all, and used them again in pass 2 to obtain final BAM file to use with featureCounts.

edit - Should I look at the STDERR files captured for 1st and 2nd pass of my STAR mapping runs, for anything specific causing header errors in the BAM output files?

4. I've looked into only one of these 100+ BAM files, and only after pass 2 (as reported in this post). I've not looked at any other BAM files from pass 2, or any at all from pass 1. This is because I'm not sure if my picard.jar ValidateSamFile is even executing OK.

Please let me know if I should provide any more information to help you help me. Thanks again.

BAM featureCounts Picard ValidateSamFile • 3.4k views

ADD COMMENT • link 5.2 years ago by Anand Rao ▴ 640

1

Entering edit mode

When I peeked into the actual counts file, I noticed some weird entries in column 1 instead of gene_IDs, as shown below. Only 49 out of 52724 lines contain such strange entries

It looks like your GTF file may be malformed. I would start checking that first. AGAT is a good toolkit for that. Can you post a couple of lines of that file? featureCounts can count on various features and then summarize on others (e.g. count exons and summarize on gene_id or gene_name). Depending on what you have in your file you may need to be explicit about this in your featureCounts command line.

Using Picard can lead down a rabbit hole of read groups and other things. I doubt it is your alignment files that are the problem here.

ADD REPLY • link 5.2 years ago by GenoMax 153k

0

Entering edit mode

Thanks for your response @genomax.

Per your suggestion, I tried to install AGAT on my local MacOSX and also UBUNTU-based HPCC remote server - install from source OR using Conda - neither gave me a successful install...

So I had to try another validation tool = validate_gtf.pl included in the Eval package from the Brent Lab @ WUSTL. I have to take it's results with a pinch of salt because of this older BioStars post. On the flip side, this Perl script may indeed report certain errors in the GTF file, per some others' experiences (link). I did receive error messages from my validation run, as shown below:

perl validate_gtf.pl -f -l -m  MtrunA17r5.0-ANR-EGN-r1.7.gtf > MtrunA17r5.0-ANR-EGN-r1.7_validate.log
Bad terminator, "8", after <attributes> name-value pair, Note "seed:204DVAAXX:1:217:367:218, on line 39. Should be ";".
<attributes> field missing quotes around values on line 39.
<attributes> field missing quotes around values on line 39.
Bad terminator, "1", after <attributes> name-value pair, Note "seed:207G6AAXX:1:88:735:931, on line 56. Should be ";".
<attributes> field missing quotes around values on line 56.
<attributes> field missing quotes around values on line 56.
Bad terminator, "7", after <attributes> name-value pair, Note "seed:HELIUM_0001:7:31:868:277, on line 167. Should be ";".
<attributes> field missing quotes around values on line 167.
Bad terminator, "9", after <attributes> name-value pair, Note "seed:204DVAAXX:1:15:558:319, on line 337. Should be ";".
Bad terminator, "0", after <attributes> name-value pair, Note "seed:204DVAAXX:1:24:831:780, on line 801. Should be ";".
Missing start_codon for transcript "mRNA:MtrunA17Chr8g0354951".
Missing stop_codon for transcript "mRNA:MtrunA17Chr8g0354951".
Missing start_codon for transcript "mRNA:MtrunA17Chr7g0249141".
Missing stop_codon for transcript "mRNA:MtrunA17Chr7g0249141".
Missing start_codon for transcript "mRNA:MtrunA17Chr1g0171521".
Missing stop_codon for transcript "mRNA:MtrunA17Chr1g0171521".
Missing start_codon for transcript "mRNA:MtrunA17Chr1g0183081".
Missing stop_codon for transcript "mRNA:MtrunA17Chr1g0183081".
Missing start_codon for transcript "mRNA:MtrunA17Chr1g0167301".
Missing stop_codon for transcript "mRNA:MtrunA17Chr1g0167301".
Transcript MtrunA17Chr8g0392442.3.5p contains no CDS features.
Transcript MtrunA17Chr5g0425982.1.3p contains no CDS features.
Transcript MtrunA17Chr6g0486782.2.3p contains no CDS features.
Transcript MtrunA17Chr7g0215522.3.3p contains no CDS features.
Transcript ncRNA:MtrunA17Chr3g1013332 contains no CDS features.

Warnings encountered:
Count Description
1118 Missing ';' terminator after attribute in <attributes> field.
1853 Missing '"'s around attribute values in  <attributes> field.
10662 Transcript contains no CDS features.
44624 Transcript has no start codon.
44624 Transcript has no stop codon.

Statistics:
52723 genes with 55286 transcripts containing 174504 cds.
Bad Genes:

However, the author of this GTF file, in the annotation group says this GTF file works perfectly well for him, when used with his BAM files, to report gene counts, using featureCounts ! As a control experiment, I've requested a BAM file that works for him, with featureCounts, so I can use it as a positive control for my own featureCounts runs, using this GTF file. It can be downloaded from this public link. Would it be too much to request you to check if this file is valid? :)

Regardless of doubts about GTF file validity, I am now even more suspicious whether the 49 weird lines in the featureCounts counts output and the exact same number of 49 warnings in my BAM file reported by PicardTools' ValidateSamFile, already found in my 1st post of this thread, indicate that my 2 pass STAR mapping protocol, adapted from this BioStars post by user caggtaagtat , might have an error somewhere - especially where I concatenate all the *SJ.out.tab files from pass 1, to use for pass 2 genome indexing and genome mapping... I think I'll run PicardTools' ValidateSamFile on individual BAM files created during pass 1, while awaiting suggestion from you and re-validation by the GTF file author. Thanks again.

ADD REPLY • link 5.2 years ago by Anand Rao ▴ 640

0

Entering edit mode

Since you are a BBTools user I would like to point out that bbmap.sh/featureCounts results are concordant with STAR/salmon so if the two-pass alignment is causing you problems consider using bbmap.

What problems did you have with conda install of AGAT?

Can you post an example line of the annotation for one of the genes above and one that does not show any problems?

ADD REPLY • link 5.2 years ago by GenoMax 153k

score 2 · Accepted Answer · 2020-06-17

I've solved this problem with help from a scientist from the annotation group that created the GTF annotation file.

First I'll mention the attempted solutions that did NOT work:

AGAT-based correction of the GTF file, and using this corrected version for STAR genome indexing and mapping - still see those few gibberish lines in the featureCounts' counts file - so may be GTF is not the problem
validate_gtf.pl included in the Eval package from the Brent Lab @ WUSTL to correct the GTF, and use this version to index and map - doesn't fix things
Try to identify at what point in my pre-processing pipeline I start seeing this problem - right from FASTQ obtained from SRA using fastq-dump - so pre-processing, which commences with this as input, just cannot be the source of this problem
Add Read Groups to BAM file using PicardTools' AddOrReplaceReadGroups, before the counting step - didn't make a difference

This PROBLEM was created by featureCounts syntax being incompatible with a few lines in the GTF file, formatted according to expert-annotation software Apollo. The SOLUTION was to use different attributes in the gtf file with featureCounts, and to specify this attribute using the "-g" parameter.

This syntax solved ALL these problematic gibberish lines in the counts file from my featureCounts step:

featureCounts -a $GTF -g locus_tag --extraAttributes gene_name -t exon -o $OUTPUT -p $BAM_INPUT

Thanks @Genomax, for your suggestions. I did end up being an annotation file issue as you'd suggested, though I could not get AGAT to diagnose / fix it....but I did get it installed with much effort :)

I consider this issue resolved...but leaving this answer up here to save colleagues time with troubleshooting a similar problem in the future.