Regex search in PDF reader

Question

I am using zathura, as I enjoy its minimalist approach, but I would also switch to mupdf or anything else if this would solve my problem.

I need to highlight every word (in PDF and epub documents) one by one from start to finish because I can concentrate better on the text if I have some kind of motion in it. My approach would have been to perform a regex search that matches every word, but neither zathura nor mupdf support regex in searches. Is there a way to do this?

I would try to fork zathura but to be honest I don't really want to spend that amount of time if there is another minimal Gnu/Linux compatible document viewer that does what I need. And if there is any way to use terminal tools like pdfgrep for highlighting the results in zathura that would also do the job.

`I need to highlight every word from start to finish` - may I ask what's the point of it? — Arkadiusz Drabczyk, Mar 29 '20 at 17:24
@ArkadiuszDrabczyk Pretty simple, I can concentrate better when a cursor moves through text — luca, Mar 29 '20 at 20:48
So you don't want to highlight all the words at the same time but just a single word at a time? If so that needs to be implemented in the PDF reader. — Arkadiusz Drabczyk, Mar 29 '20 at 20:55
Exactly, I see now that this isn't really clear from my question. — luca, Mar 29 '20 at 20:56
@vonbrand the word cursor was just the first word that came into my mind to describe something that moves through text, but I wasnt talking about _the_ cursor. I just want something that highlights every single word in a document, one word at a time and in the correct order. — luca, Mar 30 '20 at 13:26
@luca If you really want to, you can use `pdftotext` on your PDF and pipe the contents of the generated file into [speedread](https://github.com/pasky/speedread). The obvious downside is that you loose the markup. — Devon, Apr 01 '20 at 16:28
@Devon Well I've already considered that but reading technical papers without proper formatting would be a real pain. So I guess I'll have to look for a better solution — luca, Apr 01 '20 at 20:31
"I see now that this isn't really clear from my question." -- did you know that you can [edit](https://unix.stackexchange.com/posts/576734/edit) you question to make it clearer. As it stands I see nothing in the title or body of the question that is about what you want to do. — ctrl-alt-delor, Apr 05 '20 at 12:20
@ctrl-alt-delor I do not think it is that important what I need it for but I have edited it anyway. — luca, Apr 05 '20 at 12:27
Your goal is the most important thing of all. All solutions that miss the goal will not be of help for you. **All** solutions that hit the goal will be of use to you. — ctrl-alt-delor, Apr 05 '20 at 12:35

Adam Katz · Answer 1 · 2022-11-29T18:13:34.920

Basic text selection

According to the Zathura Wikipedia page:

Zathura can search for text and copy text to the primary X selection

This implies the ability to select text as you read is built in, though it likely requires your mouse (you'll be hard-pressed to find a solution for keyboard-controlled selection).

How minimalist do you need? I use Atril, a slightly lighter-weight fork of Evince (the GNOME document viewer). Atril was made as part of the MATE Desktop (a continuation of GNOME 2). It's pretty light, though it does still have a GTK+ dependency.

Another option is Xpdf application. See also Wikipedia's List of PDF Software § Linux and Unix.

Regex

The only (usable) regex search implementation I know of, aside form command-line tools like pdfgrep, is actually your web browser. This isn't so usable, but here's a solution in Firefox: Open a PDF in Firefox and open your Developer Tools Javascript Console (F12 or Ctrl+Shift+K). Run these commands:

» pdf = document.getElementById("viewer").innerText.replace(/[ \t]+/g, " ");
» function grep(what, context=100) { return pdf.match(RegExp(`[\\s\\S]{0,${context}}${what}[\\s\\S]{0,${context}}`), "img"); }
» grep("put your regex here")
» grep("get more context", 300)

Note that you'll have to escape your backslashes. The grep command has an optional second argument, the number of characters of context to provide on each side (default=100).

Chrome and other browsers with built-in PDF viewers should be rather similar, but you'll have to figure out what HTML object holds the actual PDF content (it's the id="viewer" element for Firefox, not sure about the others—in the worst case, just use document.body instead of document.getElementById("viewer"). You may match items in the table of contents.)

About the basic text selection: Yes zathura can do this, I knew about it. But I need the regex, which brings me to the second part of your answer. It is a solution although not the one I was looking for because it is a complex one, but thanks anyway. — luca, Apr 06 '20 at 19:11
Anyways, I gave you the bounty because you obviously did some work and technically provided me a solution to my problem but I am still not marking this as solved because that is one hacky way to do it and I would honestly prefer an actual pdf reader instead of the browser. But thanks again! — luca, Apr 11 '20 at 11:05

score 1 · Answer 2 · answered Apr 10 '20 at 21:18

1

If i'm not mistaken Adobe Acrobat Reader has a function called read out loud which selects each word from start to finish.

answered Apr 10 '20 at 21:18

andromeda-1865

43
6

2

Thanks, but Adobe's Linux version is deprecated and thus contains countless security vulnerabilities, so not for me. – luca Apr 11 '20 at 05:52

Regex search in PDF reader

2 Answers2

Basic text selection

Regex