I want to find the number of the lines that have both words/patterns "gene" and "+" in them. Is this possible to do this with grep?
-
5Will the word `gene` always occur before `+` on the lines that you are interested in? Would the basic regular expression `gene.*+` be enough? Do you need to filter out lines that contain words like `genes` or `thegene` (i.e. where `gene` is just a substring and not its own word)? Can you show some example data? – Kusalananda Oct 15 '20 at 14:41
-
1Related: [grep with logic operators](https://unix.stackexchange.com/questions/177513/grep-with-logic-operators) – steeldriver Oct 15 '20 at 14:42
-
You can just look for the first word and forward that list to another grep with second word: `grep gene | grep +`. That is a kind of and operator. You also need to consider all the question Kusalananda is asking. – nobody Oct 15 '20 at 15:20
-
@glennjackman I think the goal is to get number of lines and not line numbers. – nobody Oct 15 '20 at 15:26
-
I read the question more carefully after I commented: I agree. – glenn jackman Oct 15 '20 at 15:26
-
As Glen Jackman pointed out, the question is about the number of lines. So `wc` should be used at the end to count the lines `grep gene | grep + | wc -l`. – nobody Oct 15 '20 at 15:28
-
@nobody or just make the 2nd grep `grep -c +` to count matching lines – steeldriver Oct 15 '20 at 15:36
-
@nobody there's no need for `wc`, you can use `grep -c`. – terdon Oct 15 '20 at 15:48
-
@Kusalananda the word gene always appears as "gene" and is its own word and it always comes before '+'. – Parnian Oct 15 '20 at 15:48
-
Parnian, I gave you an answer assuming gff/gtf files. You might also be interested in our sister site, [bioinformatics.se]. – terdon Oct 15 '20 at 15:49
1 Answers
Yes, you can do this with grep:
grep -c 'gene.*+' file
That will look for lines where the word gene appears first and as a separate word (the \b means "word-break") and then, on the same line, you also have + as a separate word. The -c flag tells grep to print the number of matching lines. If you also need to find cases where the + comes before gene, you can do:
grep -Ec '(gene.*\+)|(\+.*gene)' file
This, however, will also match things like Eugene+Mary came for dinner which is probably not what you want. Given the words you are looking for, I am guessing that you are looking at gff/gtf files, so you might want to do something more sophisticated and only look for gene in the third field of each line and + in the seventh, on lines that don't start with a # (the gff headers). If this is indeed what you need, you can do:
awk -F"\t" '!/^#/ && $3=="gene" && $7=="+"{c++}END{print c}'
- 234,489
- 66
- 447
- 667
-
For the Eugene case and grep, we can use word boundary markers: `grep -Ec '(\
.*\+)|(\+.*\ – glenn jackman Oct 15 '20 at 17:55)' file`