Hello,
I have two files, 1 containing sequencing reads in a SAM like format, and the other contains the exon positions for HG38 and I am trying to use the positions of the exons in the second file to say whether a read in my first file is exonic or not. For ease Im a only using reads for chromosome1 and exons located on chromosome1. The desired output is an annotated copy of FileA which contains the reads that are located in exons, and those that are not, alternatively (as you can see in the code) if the echo prior to the break is removed then a filtered EXON only reads file is generated.
The INPUT files look like this
FileA: 6 columns with read position given in column 4 ($4), tab delimited
NAME FLAG CHR POS MAPQ CIGAR
FileB: 4 columns with exon start position and end position given in col2 and col3 ($2, $3) respectively. tab delimited
CHR START END GENE
I originally tried to use a while loop within a while loop to ask if the position of the read on each line of fileA was within the boundary of an exon ($2 $3) in fileB. There is a break in the statement so that the loop stops when the exon START position is greater than the read position (both files are ordered by bp position) so there is no point continuing with the loop for this read.
This is the code:
while IFS= read READ
do
echo "$READ"
position=$(echo $READ | awk '{print $4}')
while IFS= read BED
do
echo "$BED"
x=$(echo $BED | awk '{print $2}')
y=$(echo $BED | awk '{print $3}')
if (($position < "$x"))
then
echo "$READ" | awk '{print $0"\t"}' >> FileA_exon
break
else
if (($position >= "$x" && $position <= "$y"));
then
echo "$READ" | awk '{print $0"\t EXON"}' >> FileA_exon
fi
fi
done < FileB
done < FileA
FileA_exon: 7 columns as such
NAME FLAG CHR POS MAPQ CIGAR STATUS
It has been brought to my attention however that whilst this works, it could be improved upon with some colleagues suggesting that a script could be written that doesnt require any loops, nested or otherwise, and for without the need to repeatedly open FileB for each line of FileA. Perhaps in AWK... but I cannot think of a way to use information split over two files to get my desired output...
I'd really appreciate some advice and explanations where possible.
Thanks
bedtools: convert your sam to bed with bamtobed and then call "bedtools interesect" with exon.bed