Comparing reads of multiple BAM files (using getopts for parsing)
2
0
Entering edit mode
6.3 years ago
m98 ▴ 420

I am writing a Bash script that has the following aims:

  • Compare multiple BAM files (over 100) using samtools to obtain the number of mapped reads
  • Find the BAM file with the smallest number of reads
  • Based on this smallest number, use seqtk to scale all other BAM files

My problem is to do with how to parse through all BAM files. I am very new to using getopts and I can't get it to do what I want: parse through every BAM file.

Here is my script so far:

#!/usr/bin/bash

# Definine the usage and ensure it shows up on screen if no arguments are given

USAGE() { echo "Usage: bash $0 [-b <in-bam-files-dir>] [-o <out-dir>] [-c <chromlen>]" 1>&2; exit 1; }

if (($# == 0)); then
        USAGE
fi

# Use getopts to accept each argument

while getopts ":b:o:c:h" opt
do
    case $opt in
       b ) BAMFILES=$OPTARG
        ;;
       o ) OUTDIR=$OPTARG
        ;;
       c ) CHROMLEN=$OPTARG
        ;;
       h ) USAGE
        ;;
       \? ) echo "Invalid option: -$OPTARG exiting" >&2
        exit
        ;;
       : ) echo "Option -$OPTARG requires an argument" >&2
        exit
        ;;
        esac
    done

# Start parsing through BAM files - here is where I get stuck

for i in $@
do
   echo $i
done

Obviously, my for loop will not just have echo in. That would be where I would sort the BAM file, count reads etc. But for now, i am jus trying to make sure I am parsing the BAM files properly.

My problem is: how do I get getopts to understand that the BAMFILES is a directory containing multiple BAM files. For the moment, when my script reaches the for loop and write out the command echo, I get the following output:

$ bash script.bash -b /path/to/files/*bam -o ../output/dir -c option

# Output:
-b
/path/to/files/file1.bam
/path/to/files/file2.bam
/path/to/files/file1.bam
-o
../output/dir
-c
option

I guess the output I want is just:

/path/to/files/file1.bam
/path/to/files/file2.bam
/path/to/files/file1.bam

Because I could then use a for loop to say for each bam file, do this etc. Now I know that using $@ is not the right thing in the for loop, but using $1 just prints -b, not the bam file names. I just can't seem to access the BAM file names I want and nothing else.

Thanks!

bam samtools seqtl getopts • 1.9k views
ADD COMMENT
1
Entering edit mode

No offense, but this question might be better placed on stackoverflow, I'm sure you'll find more replies there.

Bash's getopts is limited and confusing, especially with larger option lists. In case you want to have more complex functionality with easier to read code and without the pitfalls of bash syntax, you might want to look into Python, specifically Python's argparse. Nothing is free, however - naturally, you'd have to get more familiar with Python and the language has its own pitfalls, too.

ADD REPLY
5
Entering edit mode
6.3 years ago

how do I get getopts to understand that the BAMFILES is a directory containing multiple BAM files.

getopts(), as the names suggests, will only parse the options. Then it's your job to check the validity of the options.

I suggest that you pass only the name of the directory containing bam files in -b. you can also check if that directory exists

if [ -d "$BAMFILES" ]; then
  # If the directory exists, do further things..
fi

After checking that it exists, you may loop thru all the bam files by

for i in  $(ls $BAMFILES/*.bam);
ADD COMMENT
0
Entering edit mode

Would be better to do for i in $(find ...) - it’s considered bad practice to parse the output of ls.

ADD REPLY
0
Entering edit mode

May I know the specific concern? Unlike find, ls is more intuitive, easy and known to almost all.

ADD REPLY
1
Entering edit mode

ls provides no protection against malformed filenames, special characters etc.

There are countless stackoverflow rants about this exact topic:

https://unix.stackexchange.com/questions/128985/why-not-parse-ls https://mywiki.wooledge.org/ParsingLs

Even if you didn’t want to use find, you can still just do: for i in /path/to/dir/*.ext and bash magic does the rest.

ADD REPLY
0
Entering edit mode

Am aware of the parsing issues of ls, but in more familiar cases, the filenames will not be that bizarre. However, agree about the usage of glob as it is a shell built-in and should be faster.

ADD REPLY
1
Entering edit mode

Actually, it's with the novice users that one has to be careful, or at least mention the caveat (ensure file names have only [aA-zZ0-9_.] characters). Pro users would be able to attribute downstream errors to improper file names.

ADD REPLY
5
Entering edit mode
6.3 years ago
Joe 21k

I’m tempted to close this as it’s really just a programming question, unless you can rephrase it somehow, but I got started with the answer so here it is:

——————

Your getopt looks fine. You just need to use the variables assigned in the case switch downstream. What you will need to do however is pass an array of files. At present, getopt is only taking in one file at a time. This can get tricky.

$1 is -b because that is the first argument in your command. Once you’ve parsed with getopt, you shouldn’t touch the standard argv style variables any more (other than maybe $0 if you want to get the script name).

bash script.sh -b somefile.ext -c someoption
       ^        ^     ^         ^     ^
       $0       $1    $2       $3    $4

This is why you see all of your arguments when you do for i in $@.

Here’s one way of handling multiple inputs to the same argument:

while getopts "hi:m:o:" OPTION ; do
  case $OPTION in
   i) ID+=($OPTARG)   ;;  # appends an array rather than set a single variable
   m) mode=$OPTARG    ;;
   o) outdir=$OPTARG  ;;
   h) usage && exit 1 ;;
  esac
done
shift $((OPTIND -1))

(From my code at: https://github.com/jrjhealey/bioinfo-tools/blob/master/getpdb.sh#L47-L55 )

Notice the use of += instead of just = for the i variable as an incremented array of files which can be looped through.

This is slightly different, as I’m iterating a set of strings, and that list is unlikely ever to be very long, so (annoyingly) they all need the -i flag each time (until i can be bothered to rewrite it perhaps). This ensures they’re all added to the array which corresponds to the IDs in question.

With a wildcard expanded list of files, you will probably want to treat it as a positional argument to come last, which will mean you don’t have to worry about capturing any other optional arguments after the files. It also means you wont need an dash-option before each one.

In which case, you’ll need to iterate through all the commandline arguments, capturing any options (which is what getopt is doing). Once you’ve parsed all of those, everything left in $@ should be your filenames which you can loop over as you’re about to do. The secret to this likes in OPTIND, the index of the arguments on the commandline. You need only find the first index which matches a file name, then gather all of the arguments from there to the end of the command in to an array (or carry on using $@. Passing a prexisting directory of bam files and parsing the output of a find command or similar would be much easier, if a little less user friendly.

If you aren’t sure what your commands in the loop should be, figure it out for a single file first. I would write separate functions for calculating the number of reads, and basically compartmentalise each step.

ADD COMMENT

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6