3

I am using zathura, as I enjoy its minimalist approach, but I would also switch to mupdf or anything else if this would solve my problem.

I need to highlight every word (in PDF and epub documents) one by one from start to finish because I can concentrate better on the text if I have some kind of motion in it. My approach would have been to perform a regex search that matches every word, but neither zathura nor mupdf support regex in searches. Is there a way to do this?

I would try to fork zathura but to be honest I don't really want to spend that amount of time if there is another minimal Gnu/Linux compatible document viewer that does what I need. And if there is any way to use terminal tools like pdfgrep for highlighting the results in zathura that would also do the job.

luca
  • 142
  • 1
  • 9
  • 3
    `I need to highlight every word from start to finish` - may I ask what's the point of it? – Arkadiusz Drabczyk Mar 29 '20 at 17:24
  • 1
    @ArkadiuszDrabczyk Pretty simple, I can concentrate better when a cursor moves through text – luca Mar 29 '20 at 20:48
  • 1
    So you don't want to highlight all the words at the same time but just a single word at a time? If so that needs to be implemented in the PDF reader. – Arkadiusz Drabczyk Mar 29 '20 at 20:55
  • Exactly, I see now that this isn't really clear from my question. – luca Mar 29 '20 at 20:56
  • 1
    @luca you want to highlight the word the cursor sits on? – vonbrand Mar 30 '20 at 13:21
  • @vonbrand the word cursor was just the first word that came into my mind to describe something that moves through text, but I wasnt talking about _the_ cursor. I just want something that highlights every single word in a document, one word at a time and in the correct order. – luca Mar 30 '20 at 13:26
  • 2
    @luca If you really want to, you can use `pdftotext` on your PDF and pipe the contents of the generated file into [speedread](https://github.com/pasky/speedread). The obvious downside is that you loose the markup. – Devon Apr 01 '20 at 16:28
  • @Devon Well I've already considered that but reading technical papers without proper formatting would be a real pain. So I guess I'll have to look for a better solution – luca Apr 01 '20 at 20:31
  • "I see now that this isn't really clear from my question." -- did you know that you can [edit](https://unix.stackexchange.com/posts/576734/edit) you question to make it clearer. As it stands I see nothing in the title or body of the question that is about what you want to do. – ctrl-alt-delor Apr 05 '20 at 12:20
  • @ctrl-alt-delor I do not think it is that important what I need it for but I have edited it anyway. – luca Apr 05 '20 at 12:27
  • Your goal is the most important thing of all. All solutions that miss the goal will not be of help for you. **All** solutions that hit the goal will be of use to you. – ctrl-alt-delor Apr 05 '20 at 12:35

2 Answers2

4

Basic text selection

According to the Zathura Wikipedia page:

Zathura can search for text and copy text to the primary X selection

This implies the ability to select text as you read is built in, though it likely requires your mouse (you'll be hard-pressed to find a solution for keyboard-controlled selection).

How minimalist do you need? I use Atril, a slightly lighter-weight fork of Evince (the GNOME document viewer). Atril was made as part of the MATE Desktop (a continuation of GNOME 2). It's pretty light, though it does still have a GTK+ dependency.

Another option is Xpdf application. See also Wikipedia's List of PDF Software § Linux and Unix.

Regex

The only (usable) regex search implementation I know of, aside form command-line tools like pdfgrep, is actually your web browser. This isn't so usable, but here's a solution in Firefox: Open a PDF in Firefox and open your Developer Tools Javascript Console (F12 or Ctrl+Shift+K). Run these commands:

» pdf = document.getElementById("viewer").innerText.replace(/[ \t]+/g, " ");
» function grep(what, context=100) { return pdf.match(RegExp(`[\\s\\S]{0,${context}}${what}[\\s\\S]{0,${context}}`), "img"); }
» grep("put your regex here")
» grep("get more context", 300)

Note that you'll have to escape your backslashes. The grep command has an optional second argument, the number of characters of context to provide on each side (default=100).

Chrome and other browsers with built-in PDF viewers should be rather similar, but you'll have to figure out what HTML object holds the actual PDF content (it's the id="viewer" element for Firefox, not sure about the others—in the worst case, just use document.body instead of document.getElementById("viewer"). You may match items in the table of contents.)

Adam Katz
  • 3,800
  • 1
  • 24
  • 33
  • About the basic text selection: Yes zathura can do this, I knew about it. But I need the regex, which brings me to the second part of your answer. It is a solution although not the one I was looking for because it is a complex one, but thanks anyway. – luca Apr 06 '20 at 19:11
  • 1
    Anyways, I gave you the bounty because you obviously did some work and technically provided me a solution to my problem but I am still not marking this as solved because that is one hacky way to do it and I would honestly prefer an actual pdf reader instead of the browser. But thanks again! – luca Apr 11 '20 at 11:05
1

If i'm not mistaken Adobe Acrobat Reader has a function called read out loud which selects each word from start to finish.

  • 2
    Thanks, but Adobe's Linux version is deprecated and thus contains countless security vulnerabilities, so not for me. – luca Apr 11 '20 at 05:52