Keep lines that dont match awk pattern, edit those that do
1
0
Entering edit mode
2.7 years ago
mrmrwinter ▴ 30

Hi,

I'm trying to edit every line in a file with awk. I'm matching a pattern, and then splitting and removing the section of the line after and including the match.

This is the line that i am using:

awk -F '_contigs' '{print $1}' before.txt > after.txt

My problem is that each line in the file that doesnt contain any string matching the given pattern gets passed as an empty line, losing the values in all of these rows in the final output.

Is there something i can pass to awk to tell it to ignore these lines and pass them unedited? Or is there a way to nest the awk in a loop and only process lines matching the pattern?

For example, i have the following list:

1_unscaffolded

2_unscaffolded

3_unscaffolded

scaffold1_contigs_1234

scaffold2_contigs_5678

scaffold3_contigs_9101112

I want the end result to be:

1_unscaffolded

2_unscaffolded

3_unscaffolded

scaffold1

scaffold2

scaffold3

But what i get with the above code is:

scaffold1

scaffold2

scaffold3

Thanks

*Edited to add example

bash awk sed • 1.9k views
ADD COMMENT
0
Entering edit mode

without example data, it is difficult to understand your query. Post some example data and expected outcome. Since this is not a relevant biology problem, this post may be marked non-relevant to the forum if you do not provide the context of the query and example data.

ADD REPLY
0
Entering edit mode

My bad. Have edited with an example

ADD REPLY
0
Entering edit mode
$ sed -re '/^[^0-9]/ s/_.*//g' test.txt

1_unscaffolded
2_unscaffolded
3_unscaffolded
scaffold1
scaffold2
scaffold3

$ awk '$0 ~ /^[^0-9]/ {gsub("_.*","")}1' test.txt

1_unscaffolded
2_unscaffolded
3_unscaffolded
scaffold1
scaffold2
scaffold3
ADD REPLY
1
Entering edit mode
2.7 years ago
# sed seems a bit more appropriate here
sed 's/_contig.*//' test
1_unscaffolded
2_unscaffolded
3_unscaffolded
scaffold1
scaffold2
scaffold3

when

cat test
1_unscaffolded
2_unscaffolded
3_unscaffolded
scaffold1_contigs_1234
scaffold2_contigs_5678
scaffold3_contigs_9101112
ADD COMMENT
1
Entering edit mode

For completeness sake, here's the awk version:

awk -F "_" '{if( $2~/contigs/) {print $1}else {print $0}}' test
1_unscaffolded
2_unscaffolded
3_unscaffolded
scaffold1
scaffold2
scaffold3
ADD REPLY
0
Entering edit mode

This is it! Thank you. I had initially tried this but without the fullstop. Does the fullstop enable the wildcard?

Thanks for the help

ADD REPLY
1
Entering edit mode

Just for the sake of future readers of the post that may not be familiar with the regular expression used here:

sed 's/_contig.*//'

  • s = use sed with substitution mode
  • the pattern that is searched for comes between the first 2 slashes /
  • the pattern here is "underscore followed by 'contig' followed by 'any character' (= . ) where 'any character' can occur any number of times (including 0 times) (= *)
  • the pattern with which the search pattern is going to be replaced with comes between the second set of slashes / -- here, it's empty, because we want to remove "_contig" and everything that comes after it
ADD REPLY

Login before adding your answer.

Traffic: 2469 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6