Entering edit mode
5.0 years ago
User000
▴
720
This is my FASTQ file,
@HISEQ1:130:C0G55ACXX:1:2104:6209:41976/2
ATAATATCCAACTTTCACCTATGTTTGAAATGTCTAAAATGTTGTTTGATTAAATTTTGCTCTATGTCATGCTAAGATGATTTAATTCATA
+
@CCFFFFFHHFHHJJIJJJJIJJJJJJIJIGIHIIIIHIJJHHHIIIJGJGGIJJJJJJIIIJIIIJJJJJJJGGIIJIJIJGEHIGHHHEH
@HISEQ1:128:C0HUMACXX:6:1312:1569:61058/1
CATGCATTTAATAACTAATTAACCGTGCATCGGATGTAAATAAATTATATATGAAACTTGCTCAGAATTTTGTGTAGATTAATAATATCCA
+
CCCFFFFFHHHHHJJJJIJJJJJJJHGIIJJJJJJJHIJIJIJIJJJJJJJIJJJIJIJJJJJIIEHGIJIJHGHEHEHHHHFFFEFFFED
I have a list of ID's which where extracted from BAM file using the following command line:
samtools view -b -f 4 input.bam > output.bam
And I get list ID's:
HISEQ1:128:C0HUMACXX:6:1312:1569:61058
HISEQ1:130:C0G55ACXX:1:2104:6209:41976
of course when I run this command line, I get empty output:
seqtk subseq in.fq name.lst > out.fq
Any suggestions? I apologise in advance if my question a duplicate o else.
Generally speaking whe I want to extract unmapped reads which of the following should I consider or all of them (source https://www.biostars.org/p/303752/):
# R1 unmapped, R2 mapped
samtools view -u -f 4 -F 264 lib_002.sorted.md.bam > lib_002_unmap_map.bam
# R1 mapped, R2 unmapped
samtools view -u -f 8 -F 260 lib_002.sorted.md.bam > lib_002_map_unmap.bam
# R1 & R2 unmapped
samtools view -u -f 12 -F 256 lib_002.sorted.md.bam > lib_002_unmap_unmap.bam
Any suggestions for what? You haven't actually told us what the question is.
My guess is your index file isn't working as its missing the read identifiers
/1
and/2
so it isn't a perfect match.I know, any suggestion how to solve this? thanks.
Not the nicest solution, bit you can use grep to get the fastq entries:
The
-f
determines the name.lst entries as things to be searched for,-A 3
leads to reporting the 3 lines after each hit. The--no-group-separator
is not in all grep versions available; it suppresses the output of a "---" line separator after each 4 lines. If your grep doesn't support it you could remove it by piping the grep into a second grep like... | grep -v ^---
.Denote you may have some false positives, if only one read of a pair wasn't aligned. Also, I would use only unique list items from the name.lst file.
thanks Michael, I will give this a try
So I tried to use this one, but it gives me an error
I have around 19M reads, shouldn't be so big,no?
with seqkit:
test.txt is a file with read IDs and test.fq input fastq (or fastq.gz). Btw, first read record in OP seems to have more quality values compared to bases in the read. Post seems to have two questions. Could you please break them down to two posts?
thanks a lot, could you please specify what does -f do?
-f for file as input @ User000
Use
filterbyname.sh
from BBMap suite.sorry, this isn't very clear form me, I will read the manual..
Are you just trying to get all unmapped reads from your data? Exactly what is your list of IDs representing?
yes, unmapped reads... none of the replies above resolve my problem because I have this /1 and /2 in my fastq file...
You have already identified the commands in the bottom of your post, having the read identifiers shouldnt be a problem for
samtools
. You can get the reads back from the BAM file you already have. You need only use something likesamtools view -b -f 4 bamfile.bam > unmapped.bam
.There's no need to go back to the FASTQs because the SAM/BAM files already contain all this information.
ok...and I nedd to extract the reads ID's I have above from the BAM file? I mean both R1 and R2
You want the reads as well as a file of their IDs?
having the list of my reads and bam files that were transformed in fastq (of course my list of reads has only some IDs,not all present in the fastq file), I want to extract the fastq's of the reads that I have in my list.... but since i have /1 and /2 It is empty... I am guessing I should transform my bam files into fq1 and fq2 then get my reads,.......
If you just want unmapped reads, that what the
samtools
command is giving you. This means you don't need your list of unmapped headers. The 2 should be equivalent already.If you want the headers and reads, you can take the
unmapped.bam
file from the first command, and usesamtools bam2fq
to get the unmapped FASTQs (all of them).Then you can use
grep '@' unmapped.fastq
to get the headers of those reads if you desperately need to.I thank you very much for trying to help me. I try to explain better: LET'S SAY, I have 2 BAM files each with unmapped reads. I extracted the ID's of the reads in those 2 BAM files and compared the list and got the ID list that is present in BAM 1 but not BAM 2. Now I want to extract the FASTQ of these ID's. When I say fastq I don't mean read ID's, but actually the 4 line FASTQ (isn't it the format of fastq?). To do this, I guess I need to transform BAM to FASTQ. What I did. But actually, in my fastq the ID's are like this (so R1 and R2, I guess):
and in my list it is like this:
That's why my out.fq is empty. This is what I was asking how to solve these problem?
moreover, if we do grep '@' it is going to grep also some quality values in the lines below...
Possibly. So
grep
for^@HISEQ1
instead.if it was that easy, not all of them are hiseq1.
You really should have made this clearer. There was no mention of a second file at any point up to now as far as I can tell. You are making your own question almost impossible to answer because you are moving the goalposts and confusing people by asking for different things each time.
You're right about
@
. Prefixing it with^
will help a lot, but you can also compose a regex such that the line also has to end with/1
or/2
to make sure no sequence is picked up. Grep or a proper parser are never the less appropriate tools for this.I have many bam files, it was just an example. The problem is that you guys don't try to read and UNDERSTAND the question. I guess it is clear I have ID's from BAM and trying to get them their FASTQ. Anyway grep o parse would not solve the problem from biological point pf view. I will get only one of the two reads right? And when I want to get both reads tht do not map? Should I get it's /1 or /2, the answer is both. Anyway thanks for this discussion.