2

I have downloaded the KingBase Lite 2018 Update 3 file from here. I now want to extract data from a single event such as the "FIDE Candidates 2018": I want to get all the paragraphs containing this text and the paragraph below it, so I have the whole pgn for each game.

To first just get the paragraph that contains the text, I followed these recommendations.

However, when I try awk -v RS='' -v ORS='\n\n' '/FIDE Candidates 2018/' KingBaseLite2018-03.pgn, it just prints the whole file. When I search for a word that does not exist, it does not print anything. So I assume it does the search correctly, but it somehow does not properly cut at new lines. There might be something awkward about the new line characters in that file. When I try other suggestions from the above link like using perl, I get the same result.

What can I do to get the paragraph now? And how can I include one paragraph below as well?

maddingl
  • 554
  • 1
  • 6
  • 20
  • 2
    Might be due to your use of `RS`? Anyhow, you should post a sample input and desired output to help those who want to help you reproduce your issue and test the solutions. – simlev Jun 06 '18 at 10:29
  • 3
    The problem seems to be that the file has Windows-style CRLF line-endings – steeldriver Jun 06 '18 at 11:35
  • 2
    ... although it's not entirely equivalent, probably setting `RS='\r\n\r\n'` is sufficient in this case – steeldriver Jun 06 '18 at 11:54

1 Answers1

3

I downloaded and unzipped the file, and the line endings are CRLF, so you need to account for that, either by using a tool like fromdos, or if you don't want to modify the file, you can to tell Perl that you want it to do the translation with its :crlf PerlIO layer, which is what I'm doing below with the PERLIO environment variable. (There are other ways to change the layers, but this one was easiest for a one-liner.)

I'm using the flip-flop operator ... to extract only the paragraph that matches the regex plus the following one that matches /^1\./ (since all the paragraphs in the file start with either [ or 1.).

wget http://kingbase-chess.net/download/650 -O KingBaseLite2018-03.zip
unzip KingBaseLite2018-03.zip
PERLIO=:crlf perl -00ne 'print if /"FIDE Candidates 2018"/.../^1\./' KingBaseLite2018-03.pgn
haukex
  • 283
  • 1
  • 9