I have a the reads from a single cell RNA seq experiment and I want to get the relevant reference sequence for them.
I subset these reads using samtools to only include the regions that interest me.
samtools view -@ 7 -q 30 -L $regions possorted_genome_bam.bam > output_sam.sam
I then also remove all reads, that do not have the CB field.
awk -F '\t' '{print $22}' output_sam.sam | sort | uniq -c | awk '{print $2}' | grep "CB" > $relevant_cells
grep -f $relevant_cells output_sam.sam > output_sam_cells.sam
From my reference fasta file, I get a region that overlaps with all my regions of interest. This reference fasta is the same file, I used for cellranger count.
bedtools getfasta -fi $REFERENCE_FASTA -bed $MY_REGION -fo region.fasta
I then use R to load the output_sam_cells.sam file and the region.fasta file. From the position of each read, I then subtract the beginning position of my larger region. That should give me the position for the string.
input$V4 <- input$V4 - region.start + 1 # + 1 because it is 1-based
But when I then compare the 2 sequences, they are completely different. For example read 1. The reads have a length of 91, so I take the sub string from the start (V4) to start +90.
read <- input[1,]
ref.subset <- substr(LARGER_REGION, start = read$V4, stop = read$V4 + 90)
Output:
read "GCAGTGGTATCAACGCAGAGTACATGGGTAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCAATGGTGTAGAAGCTATTAATGGTTC"
ref.subset "AGGGTTTACGACCTCGATGTTGGATCAGGACATCCTAATGGTGTAGCCGCTATTAAGGGTTCATTTATTCAATGATTAAAGTCCTATGTGA"
If i then take a sequence a bit upstream, I at least get a partial match.
ref.subset <- substr(LARGER_REGION, start = read$V4 -29, stop = read$V4 -29 + 90)
Output:
read "GCAGTGGTATCAACGCAGAGTACATGGGTAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCAATGGTGTAGAAGCTATTAATGGTTC"
ref.subset "GTGCAATCCTATTCTATCATATCAATAATAGGGTTTACGACCTCGATGTTGGATCAGGACATCCTAATGGTGTAGCCGCTATTAAGGGTTC"
But for the second read, I would have to use a window 31 positions upstream.
read <- input[2,]
ref.subset <- substr(LARGER_REGION, start = read$V4-31, stop = read$V4-31 + 90)
Output:
read "AAGCAGTGGTATCAACGCAGAGTACATGGGAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCAATGGTGTAGAAGCTATTAATGGTT"
ref.subset "AGTGCAATCCTATTCTATCATATCAATAATAGGGTTTACGACCTCGATGTTGGATCAGGACATCCTAATGGTGTAGCCGCTATTAAGGGTT"
I do not expect a perfect match, but it seems that I am missing something very basic here.
very dangerous if $22 isn't always the column for this tag , why
uniq -c
?grep -f
how about partial matching ? what if the tag appears elsewhere ?SOFT clip ?
Thank you for your answer.
Yes, not all lines have the same number of fields. But field 22 should be the cell barcode, if the read has been properly mapped. I will check this again.
Thanks for noticing. I copied that part from a different script. There is no particular reason for counting.
With this part, I just want to remove all the reads that are not correctly mapped. If I take a partial match, reads that are not correctly mapped could also be included. If the tag appears more than once, that would be okay. then it would just be a different read for the same cell.
What do you mean by this? I assume that I should remove the first few nucleotides and just retain the matching part. But that would not solve the issue, that the reference sequence is completely wrong for the start position of the read.