5

I have a subtitle file, it looks like this:

00:00:44:" Myślę, więc jestem".|Kartezjusz, 1596-1650
00:01:01:Trzynaste Pietro
00:01:06:Podobno niewiedza uszczęśliwia.
00:01:10:Po raz pierwszy w życiu|zgadzam się z tym.
00:01:13:Wolałbym...
00:01:15:nigdy nie odkryć|tej straszliwej prawdy.
00:01:19:Teraz już wiem...

I'm not sure what format this is, but I wanted to convert the subtitles to .srt. Unfortunately gnome-subtitles and subtitleeditor can't recognize this kind of format.

gnome-subtitles says:

Unable to detect the subtitle format. Please check that the file type is supported.

subtitleeditor says:

Please check that the file contains subtitles in a supported format.

file output:

UTF-8 Unicode text

Is there a way to convert this file to .srt format?

terdon
  • 234,489
  • 66
  • 447
  • 667
Mikhail Morfikov
  • 10,309
  • 19
  • 69
  • 104

2 Answers2

6

This is very similar to @goldilock's approach but, IMO, simpler and can deal with empty lines in the file and replaces | with a line break :

#!/usr/bin/env perl
my ($time, $text, $next_time, $next_text);
my ($c,$i)=0;
while (<>) {
    ## skip bad lines
    next unless /^\s*([:\d]+)\s*:(.+)/;
    ## If this is the first line. I could have used $. but this is
    ## safer in case the file contains an empty line at the beginning.
    if ($c == 0) {
      $time=$1; 
      $text=$2;
      $c++;
    }
    else {
      ## This is the counter for the subtitle index
      $i++;
      ## Save the current values
      $next_time=$1; 
      $next_text=$2;     
      ## I am assuming that the | should be interpreted
      ## as a newline, remove this if I'm wrong.
      $text=~s/\|/\n/g;     
      ## Print the previous subttitle
      print "$i\n$time,100 --> $next_time,000\n$text\n\n";        
      ## Save the current one for the next line
      $time=$next_time; $text=$next_text;
    }
}     
## Print the last subtitle. It will be dislayed for a minute
## 'cause I'm lazy.
$i++;
$time=~/(\d+:)(\d+)(:\d+)/;
my $newtime=$1 . (sprintf "%02d", $2+1) . $3;
print "$i\n$time,100 --> $newtime,000\n$text\n\n";    

Save the script as a file and make it executable, then run:

./script.pl subfile > good_subs.srt

The output I get on your sample was:

1
00:00:44,100 --> 00:01:01,000
" Myślę, więc jestem".
Kartezjusz, 1596-1650

2
00:01:01,100 --> 00:01:06,000
Trzynaste Pietro

3
00:01:06,100 --> 00:01:10,000
Podobno niewiedza uszczęśliwia.

4
00:01:10,100 --> 00:01:13,000
Po raz pierwszy w życiu
zgadzam się z tym.

5
00:01:13,100 --> 00:01:15,000
Wolałbym...

6
00:01:15,100 --> 00:01:19,000
nigdy nie odkryć
tej straszliwej prawdy.

7
00:01:19,100 --> 00:02:19,000
Teraz już wiem...
terdon
  • 234,489
  • 66
  • 447
  • 667
  • The last subtitle in your output ends 0.1 seconds before it starts! – goldilocks Jan 26 '14 at 20:03
  • This works pretty well. I just need to customize some entries because they're displayed for a little bit too long. Maybe there's a way to put that in the script, let's say 5-8secs max. If you want to experiment more with the subtitles, I uploaded it to pasebin : http://pastebin.com/vZP419eG – Mikhail Morfikov Jan 26 '14 at 20:08
  • 1
    @goldilocks damn, sorry, forgot `use Time::Machine` :). Thanks, fixed. – terdon Jan 26 '14 at 20:36
  • 1
    @MikhailMorfikov it's possible but increases the complexity because that means that we need to manipulate times, so that `1:59 + 20 = 2:19`. This means either complex code or using external modules and seemed beyond the scope of the question. – terdon Jan 26 '14 at 20:38
  • +1 Nice job. For the time you could use an algorithm based on the length of the text string, say 1/2 second per character but not exceeding the start of the next title. – goldilocks Jan 27 '14 at 09:55
2

What Thorsten meant is something like this:

#!/usr/bin/perl
use strict;
use warnings FATAL => qw(all);

my $END = '!!ZZ_END';
my $LastTitleDuration = 5;

my $count = 1;
my $line = <STDIN>;
chomp $line;
my $next = <STDIN>;
while ($line) {
    $next = lastSubtitle($line) if !$next;
    last if !$next;
    chomp $next;
    if (!($next =~ m/^\d\d:\d\d:\d\d:.+/)) { 
        print STDERR 'Skipping bad data at line '.($count+1).":\n$line\n";
        $next = <STDIN>;
        next;
    }
    printf STDOUT
        "%d\r\n%s,100 --> %s,000\r\n%s\r\n\r\n",
        $count++,
        substr($line, 0, 8),
        substr($next, 0, 8),
        substr($line, 9)
    ;
} continue {
    $line = $next;
    $next = <STDIN>;
}

sub lastSubtitle {
    my $line = shift;
    $line =~ /^(\d\d:\d\d:)(\d\d):(.+)/;
    return 0 if $3 eq $END;
    return sprintf("$1%2d:$END", $2 + $LastTitleDuration);
} 

When I feed your sample data into this, I get:

1
00:00:44,100 --> 00:01:01,000
" Myślę, więc jestem".|Kartezjusz, 1596-1650

2
00:01:01,100 --> 00:01:06,000
Trzynaste Pietro

3
00:01:06,100 --> 00:01:10,000
Podobno niewiedza uszczęśliwia.

4
00:01:10,100 --> 00:01:13,000
Po raz pierwszy w życiu|zgadzam się z tym.

5
00:01:13,100 --> 00:01:15,000
Wolałbym...

6
00:01:15,100 --> 00:01:19,000
nigdy nie odkryć|tej straszliwej prawdy.

7
00:01:19,100 --> 00:01:24,000
Teraz już wiem...

Couple of points:

  • The subtitles actually start 1/10th second late so they do not overlap, and because I was too lazy to add in some math involving the second timestamp. They then stay on until 1/10th second before the next title.

  • The last title stays up for $LastTitleDuration (5 seconds).

  • I used CRLF line endings as per the SupRip wikipedia article although that may not be necessary.

  • It presumes the first line of input is not malformed. Beyond that, they are checked, and errors are reported to stdout, so:

    readAlongToSRT.pl < readAlong.txt > whatever.srt
    

    Should create the file but still print errors to the screen.

  • Processing will stop at a blank line.

  • See terdon's comment below re: the possible significance of | in the subtitle content. You may want to insert $line =~ s/|/\r\n/g; before the printf STDOUT line.

This took me 20 minutes and the only test data I had was those 7 lines, so don't count on it being perfect. If there are ever line breaks in the subtitles, that will cause a problem. I presumed there aren't; if that is the case I suggest you remove them from the input first rather than trying to deal with them here.

goldilocks
  • 86,451
  • 30
  • 200
  • 258
  • 1
    Damn, beat me to it and using the same approach! Nice one, +1. I _think_ that the `|` in the original format should be changed to `\n` but that's just a guess. – terdon Jan 26 '14 at 19:08
  • @terdon Hmmm, yeah that might make sense. – goldilocks Jan 26 '14 at 19:10