Question

FASTQ exctract ID's

0

Entering edit mode

5.7 years ago

User000 ▴ 750

This is my FASTQ file,

@HISEQ1:130:C0G55ACXX:1:2104:6209:41976/2
ATAATATCCAACTTTCACCTATGTTTGAAATGTCTAAAATGTTGTTTGATTAAATTTTGCTCTATGTCATGCTAAGATGATTTAATTCATA
+
@CCFFFFFHHFHHJJIJJJJIJJJJJJIJIGIHIIIIHIJJHHHIIIJGJGGIJJJJJJIIIJIIIJJJJJJJGGIIJIJIJGEHIGHHHEH
@HISEQ1:128:C0HUMACXX:6:1312:1569:61058/1
CATGCATTTAATAACTAATTAACCGTGCATCGGATGTAAATAAATTATATATGAAACTTGCTCAGAATTTTGTGTAGATTAATAATATCCA
+
CCCFFFFFHHHHHJJJJIJJJJJJJHGIIJJJJJJJHIJIJIJIJJJJJJJIJJJIJIJJJJJIIEHGIJIJHGHEHEHHHHFFFEFFFED

I have a list of ID's which where extracted from BAM file using the following command line:

samtools view -b -f 4  input.bam > output.bam

And I get list ID's:

HISEQ1:128:C0HUMACXX:6:1312:1569:61058
HISEQ1:130:C0G55ACXX:1:2104:6209:41976

of course when I run this command line, I get empty output:

seqtk subseq in.fq name.lst > out.fq

Any suggestions? I apologise in advance if my question a duplicate o else.

Generally speaking whe I want to extract unmapped reads which of the following should I consider or all of them (source https://www.biostars.org/p/303752/):

# R1 unmapped, R2 mapped
samtools view -u -f 4 -F 264 lib_002.sorted.md.bam > lib_002_unmap_map.bam
# R1 mapped, R2 unmapped
samtools view -u -f 8 -F 260 lib_002.sorted.md.bam > lib_002_map_unmap.bam
# R1 & R2 unmapped
samtools view -u -f 12 -F 256 lib_002.sorted.md.bam > lib_002_unmap_unmap.bam

fastq • 3.4k views

ADD COMMENT • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

Any suggestions for what? You haven't actually told us what the question is.

My guess is your index file isn't working as its missing the read identifiers /1 and /2 so it isn't a perfect match.

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

I know, any suggestion how to solve this? thanks.

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

Not the nicest solution, bit you can use grep to get the fastq entries:

grep -f name.lst -A 3 --no-group-separator in.fq > out.fq

The -f determines the name.lst entries as things to be searched for, -A 3 leads to reporting the 3 lines after each hit. The --no-group-separatoris not in all grep versions available; it suppresses the output of a "---" line separator after each 4 lines. If your grep doesn't support it you could remove it by piping the grep into a second grep like ... | grep -v ^---.

Denote you may have some false positives, if only one read of a pair wasn't aligned. Also, I would use only unique list items from the name.lst file.

ADD REPLY • link 5.7 years ago by michael.ante ★ 4.0k

0

Entering edit mode

thanks Michael, I will give this a try

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

So I tried to use this one, but it gives me an error

grep: memory exhausted

I have around 19M reads, shouldn't be so big,no?

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

with seqkit:

$ seqkit grep -f test.txt test.fq

test.txt is a file with read IDs and test.fq input fastq (or fastq.gz). Btw, first read record in OP seems to have more quality values compared to bases in the read. Post seems to have two questions. Could you please break them down to two posts?

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

thanks a lot, could you please specify what does -f do?

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

-f for file as input @ User000

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

Usefilterbyname.sh from BBMap suite.

filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam)

Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

ADD REPLY • link 5.7 years ago by GenoMax 153k

0

Entering edit mode

sorry, this isn't very clear form me, I will read the manual..

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

Are you just trying to get all unmapped reads from your data? Exactly what is your list of IDs representing?

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

yes, unmapped reads... none of the replies above resolve my problem because I have this /1 and /2 in my fastq file...

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

You have already identified the commands in the bottom of your post, having the read identifiers shouldnt be a problem for samtools. You can get the reads back from the BAM file you already have. You need only use something like samtools view -b -f 4 bamfile.bam > unmapped.bam.

There's no need to go back to the FASTQs because the SAM/BAM files already contain all this information.

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

ok...and I nedd to extract the reads ID's I have above from the BAM file? I mean both R1 and R2

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

You want the reads as well as a file of their IDs?

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

having the list of my reads and bam files that were transformed in fastq (of course my list of reads has only some IDs,not all present in the fastq file), I want to extract the fastq's of the reads that I have in my list.... but since i have /1 and /2 It is empty... I am guessing I should transform my bam files into fq1 and fq2 then get my reads,.......

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

If you just want unmapped reads, that what the samtools command is giving you. This means you don't need your list of unmapped headers. The 2 should be equivalent already.

If you want the headers and reads, you can take the unmapped.bam file from the first command, and use samtools bam2fq to get the unmapped FASTQs (all of them).

Then you can use grep '@' unmapped.fastq to get the headers of those reads if you desperately need to.

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

I thank you very much for trying to help me. I try to explain better: LET'S SAY, I have 2 BAM files each with unmapped reads. I extracted the ID's of the reads in those 2 BAM files and compared the list and got the ID list that is present in BAM 1 but not BAM 2. Now I want to extract the FASTQ of these ID's. When I say fastq I don't mean read ID's, but actually the 4 line FASTQ (isn't it the format of fastq?). To do this, I guess I need to transform BAM to FASTQ. What I did. But actually, in my fastq the ID's are like this (so R1 and R2, I guess):

@HISEQ1:130:C0G55ACXX:1:2104:6209:41976/2

and in my list it is like this:

HISEQ1:130:C0G55ACXX:1:2104:6209:41976

That's why my out.fq is empty. This is what I was asking how to solve these problem?

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

moreover, if we do grep '@' it is going to grep also some quality values in the lines below...

ADD REPLY • link 5.7 years ago by User000 ▴ 750

0

Entering edit mode

Possibly. So grep for ^@HISEQ1 instead.

ADD REPLY • link 5.7 years ago by GenoMax 153k

0

Entering edit mode

if it was that easy, not all of them are hiseq1.

ADD REPLY • link 5.7 years ago by User000 ▴ 750

1

Entering edit mode

I try to explain better: LET'S SAY, I have 2 BAM files each with unmapped reads. I extracted the ID's of the reads in those 2 BAM files and compared the list and got the ID list that is present in BAM 1 but not BAM 2. Now I want to extract the FASTQ of these ID's.

You really should have made this clearer. There was no mention of a second file at any point up to now as far as I can tell. You are making your own question almost impossible to answer because you are moving the goalposts and confusing people by asking for different things each time.

You're right about @. Prefixing it with ^ will help a lot, but you can also compose a regex such that the line also has to end with /1 or /2 to make sure no sequence is picked up. Grep or a proper parser are never the less appropriate tools for this.

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

I have many bam files, it was just an example. The problem is that you guys don't try to read and UNDERSTAND the question. I guess it is clear I have ID's from BAM and trying to get them their FASTQ. Anyway grep o parse would not solve the problem from biological point pf view. I will get only one of the two reads right? And when I want to get both reads tht do not map? Should I get it's /1 or /2, the answer is both. Anyway thanks for this discussion.

ADD REPLY • link 5.7 years ago by User000 ▴ 750