How do I get a websites title using command line?

Question

I want a command line program that prints the title of a website. For e.g.:

Alan:~ titlefetcher http://www.youtube.com/watch?v=Dd7dQh8u4Hc

should give:

Why Are Bad Words Bad?

You give it the url and it prints out the Title.

When I download that title I get: "Why Are Bad Words Bad? - Youtube", do you want the "- Youtube" truncated too? — slm, Dec 01 '13 at 17:28

score 56 · Accepted Answer · edited Jun 11 '20 at 12:04

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'

You can pipe it to GNU recode if there are things like < in it:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
  recode html..

To remove the - youtube part:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
 perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)(?: - youtube)?\s*<\/title/si'

To point out some of the limitations:

portability

There is no standard/portable command to do HTTP queries. A few decades ago, I would have recommended lynx -source instead here. But nowadays, wget is more portable as it can be found by default on most GNU systems (including most Linux-based desktop/laptop operating systems). Other fairly portables ones include the GET command that comes with perl's libwww that is often installed, lynx -source, and to a lesser extent curl. Other common ones include links -source, elinks -source, w3m -dump_source, lftp -c cat...

HTTP protocol and redirection handling

wget may not get the same page as the one that for instance firefox would display. The reason being that HTTP servers may choose to send a different page based on the information provided in the request sent by the client.

The request sent by wget/w3m/GET... is going to be different from the one sent by firefox. If that's an issue, you can alter wget behaviour to change the way it sends the request though with options.

The most important ones here in this regard are:

Accept and Accept-language: that tells the server in which language and charset the client would like to get the response in. wget doesn't send any by default so the server will typically send with its default settings. firefox on the other end is likely configured to request your language.
User-Agent: that identifies the client application to the server. Some sites send different content based on the client (though that's mostly for differences between javascript language interpretations) and may refuse to serve you if you're using a robot-type user agent like wget.
Cookie: if you've visited this site before, your browser may have permanent cookies for it. wget will not.

wget will follow the redirections when they are done at the HTTP protocol level, but since it doesn't look at the content of the page, not the ones done by javascript or things like <meta http-equiv="refresh" content="0; url=http://example.com/">.

Performance/Efficiency

Here, out of laziness, we have perl read the whole content in memory before starting to look for the <title> tag. Given that the title is found in the <head> section that is in the first few bytes of the file, that's not optimal. A better approach, if GNU awk is available on your system could be:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
  gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'

That way, awk stops reading after the first </title, and by exiting, causes wget to stop downloading.

Parsing of the HTML

Here, wget writes the page as it downloads it. At the same time, perl, slurps its output (-0777 -n) whole in memory and then prints the HTML code that is found between the first occurrences of <title...> and </title.

That will work for most HTML pages that have a <title> tag, but there are cases where it won't work.

By contrast coffeeMug's solution will parse the HTML page as XML and return the corresponding value for title. It is more correct if the page is guaranteed to be valid XML. However, HTML is not required to be valid XML (older versions of the language were not), and because most browsers out there are lenient and will accept incorrect HTML code, there's even a lot of incorrect HTML code out there.

Both my solution and coffeeMug's will fail for a variety of corner cases, sometimes the same, sometimes not.

For instance, mine will fail on:

<html><head foo="<title>"><title>blah</title></head></html>

or:

<!-- <title>old</title> --><title>new</title>

While his will fail on:

<TITLE>foo</TITLE>

(valid html, not xml) or:

or:

<title>...</title>
...
<script>a='<title>'; b='</title>';</script>

(again, valid html, missing <![CDATA[ parts to make it valid XML).

<title>foo <<<bar>>> baz</title>

(incorrect html, but still found out there and supported by most browsers)

interpretation of the code inside the tags.

That solution outputs the raw text between <title> and </title>. Normally, there should not be any HTML tags in there, there may possibly be comments (though not handled by some browsers like firefox so very unlikely). There may still be some HTML encoding:

$ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
Wallace &amp; Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube

Which is taken care of by GNU recode:

$ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
   recode html..
Wallace & Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube

But a web client is also meant to do more transformations on that code when displaying the title (like condense some of the blanks, remove the leading and trailing ones). However it's unlikely that there'd be a need for that. So, as in the other cases, it's up to you do decide whether it's worth the effort.

Character set

Before UTF-8, iso8859-1 used to be the preferred charset on the web for non-ASCII characters though strictly speaking they had to be written as é. More recent versions of HTTP and the HTML language have added the possibility to specify the character set in the HTTP headers or in the HTML headers, and a client can specify the charsets it accepts. UTF-8 tends to be the default charset nowadays.

So, that means that out there, you'll find é written as é, as é, as UTF-8 é, (0xc3 0xa9), as iso-8859-1 (0xe9), with for the 2 last ones, sometimes the information on the charset in the HTTP headers or the HTML headers (in different formats), sometimes not.

wget only gets the raw bytes, it doesn't care about their meaning as characters, and it doesn't tell the web server about the preferred charset.

recode html.. will take care to convert the é or é into the proper sequence of bytes for the character set used on your system, but for the rest, that's trickier.

If your system charset is utf-8, chances are it's going to be alright most of the time as that tends to be the default charset used out there nowadays.

$ wget -qO- 'http://www.youtube.com/watch?v=if82MGPJEEQ' |
 perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
Noir Désir - L&#39;appartement - YouTube

That é above was a UTF-8 é.

But if you want to cover for other charsets, once again, it would have to be taken care of.

It should also be noted that this solution won't work at all for UTF-16 or UTF-32 encoded pages.

To sum up

Ideally, what you need here, is a real web browser to give you the information. That is, you need something to do the HTTP request with the proper parameters, intepret the HTTP response correctly, fully interpret the HTML code as a browser would, and return the title.

As I don't think that can be done on the command line with the browsers I know (though see now this trick with lynx), you have to resort to heuristics and approximations, and the one above is as good as any.

You may also want to take into consideration performance, security... For instance, to cover all the cases (for instance, a web page that has some javascript pulled from a 3rd party site that sets the title or redirect to another page in an onload hook), you may have to implement a real life browser with its dom and javascript engines that may have to do hundreds of queries for a single HTML page, some of which trying to exploit vulnerabilities...

While using regexps to parse HTML is often frowned upon, here is a typical case where it's good enough for the task (IMO).

Does it download the images from the pages too? Also will it leave junk html files behind? — Ufoguy, Dec 01 '13 at 13:25
I don't understand "You pipe to to recode html.. if there are things like < in it.", can you give examples please? Thanks. — That Brazilian Guy, Dec 01 '13 at 13:51
That Brazilian Guy: pipe is a basic unix concept, tl dr too explain here. Basically you send the output to the next command instead of the screen. recode is a program to recode stuff — Michael Durrant, Dec 01 '13 at 15:30
recode works on files so you get the output 'piped' (saved) in a file and then you use recode on it. I had to install record with apt-get. — Michael Durrant, Dec 01 '13 at 15:32
You probably want to terminate the title at the first instance of `<` since titles are not guaranteed to have end tags and any other tag should force its termination. You may also want to strip new lines. — Brian Nickel, Dec 01 '13 at 16:53
This didn't return anything for me when I ran it. Is it working for others? wget ver: 1.12, Perl ver: v5.14.0. It literally returns nothing and drops to a new prompt. I think the issue is with `wget`. If I run it without the pipe I get binary data dumped to the console. — slm, Dec 01 '13 at 17:24
Worked on Ubuntu 12.10, so something changed with `wget` at some point. wget ver: 1.13.4 — slm, Dec 01 '13 at 18:07
It is not recommended to use regular expressions to parse HTML. Ever. Not even in this case. It's a bad habit. Use a real parser instead. There is a famous humorous Stackoverflow answer about this... — Robin Green, Dec 01 '13 at 19:38
@RobinGreen That post was about using regex to parse a non-regular language. There are caveats, but this is a problem that is easily reduced to a regular language. I recommend using regex to parse HTML. Sometimes. In this case. — Brian Nickel, Dec 01 '13 at 20:40
@BrianNickel, even firefox doesn't support a title without the end tag, so I wouldn't bother covering for that. — Stéphane Chazelas, Dec 02 '13 at 10:23
@RobinGreen, you need an HTML browser, here. An XML parser is going to be a worse heuristic than my regexp one. — Stéphane Chazelas, Dec 02 '13 at 10:26
@StephaneChazelas HTML parsers exist, I did not say XML parser. There are about thousands of different languages that can be parsed, XML is only one of them! — Robin Green, Dec 02 '13 at 10:38
@RobinGreen: and the number of HTML parsers that work for almost everything is approximately 3. — ninjalj, Dec 02 '13 at 15:02
And the number of regular expressions that work for almost everything is approximately 0. — Robin Green, Dec 02 '13 at 15:20
Ok, so the `..` is just a more explicitly saying that `html` is the **input** encoding http://www.chiark.greenend.org.uk/doc/recode-doc/Tutorial.html#Tutorial — ruohola, Apr 05 '21 at 22:31

Vombat · Answer 2 · 2013-12-01T18:48:21.413

28

You can also try hxselect (from HTML-XML-Utils) with wget as follows:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' | hxselect -s '\n' -c  'title' 2>/dev/null

You can install hxselect in Debian based distros using:
sudo apt-get install html-xml-utils.

STDERR redirection is to avoid the Input is not well-formed. (Maybe try normalize?) message.

In order to get rid of "- YouTube", pipe the output of the above command to awk '{print substr($0, 0, length($0)-10)}'.

edited Dec 01 '13 at 18:48

answered Dec 01 '13 at 11:45

Vombat

12,654
13
44
58

"hxselect" does'nt seem to be installed on Ubuntu by default. I'm even unable to find it in my existing repositories. How do I install it? – Ufoguy Dec 01 '13 at 12:55
8

`sudo apt-get install html-xml-utils` – Vombat Dec 01 '13 at 14:06
I get this error on Ubuntu 12.10 "Input is not well-formed. (Maybe try normalize?)" – slm Dec 01 '13 at 18:00
1

You'll likely want to add `-i` to `hxselect` too for the same reasons that manatwork mentioned in my A, otherwise it won't match ``. – slm Dec 01 '13 at 18:09
1

I haven't found what to do with the msg. about normalizing the output. No such switch on `hxselect`. – slm Dec 01 '13 at 18:10
@slm Just direct the error to `/dev/null`. I guess it is a bug in pop function of `hxselect`. – Vombat Dec 01 '13 at 18:16
3

For the Mac OS X folks [Homebrew](http://brew.sh/) has a formula with hxselect in it. Install with `brew install html-xml-utils`. – Sukima Jul 02 '14 at 02:15
For those getting the "Maybe try normalize?" message try piping the HTML through `hxnormalize -l 3000 -x 2>/dev/null` before piping it through `hxselect`. – frederickjh Apr 04 '19 at 21:13
1

I was having issues with hxselect completely refusing to work on some poorly-structured HTML, so I had to pipe it through hxclean first (e.g. `wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' | hxclean | hxselect -s '\n' -c 'title' 2>/dev/null`). – Hayden Schiff Sep 22 '19 at 05:52

slm · Answer 3 · 2013-12-02T13:08:17.423

You can also use curl and grep to do this. You'll need to enlist the use of PCRE (Perl Compatible Regular Expressions) in grep to get the look behind and look ahead facilities so that we can find the <title>...</title> tags.

Example

$ curl 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' -so - | \
    grep -iPo '(?<=<title>)(.*)(?=</title>)'
Why Are Bad Words Bad? - YouTube

Details

The curl switches:

-s = silent
-o - = send output to STDOUT

The grep switches:

-i = case insensitivity
-o = Return only the portion that matches
-P = PCRE mode

The pattern to grep:

(?<=<title>) = look for a string that starts with this to the left of it
(?=</title>) = look for a string that ends with this to the right of it
(.*) = everything in between <title>..</title>.

More complex situations

If <title>...</titie> spans multiple lines, then the above won't find it. You can mitigate this situation by using tr, to delete any \n characters, i.e. tr -d '\n'.

Example

Sample file.

$ cat multi-line.html 
<html>
<title>
this is a \n title
</TITLE>
<body>
<p>this is a \n title</p>
</body>
</html>

And a sample run:

$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
     tr -d '\n' | \
     grep -iPo '(?<=<title>)(.*)(?=</title>)'
this is a \n title

lang=...

If the <title> is set like this, <title lang="en"> then you'll need to remove this prior to greping it. The tool sed can be used to do this:

$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
     tr -d '\n' | \
     sed 's/ lang="\w+"//gi' | \
     grep -iPo '(?<=<title>)(.*)(?=</title>)'
this is a \n title

The above finds the case insensitive string lang= followed by a word sequence (\w+). It is then stripped out.

A real HTML/XML Parser - using Ruby

At some point regex will fail in solving this type of problem. If that occurs then you'll likely want to use a real HTML/XML parser. One such parser is Nokogiri. It's available in Ruby as a Gem and can be used like so:

$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
    ruby -rnokogiri -e \
     'puts Nokogiri::HTML(readlines.join).xpath("//title").map { |e| e.content }'

this is a \n title

The above is parsing the data that comes via the curl as HTML (Nokogiri::HTML). The method xpath then looks for nodes (tags) in the HTML that are leaf nodes, (//) with the name title. For each found we want to return its content (e.content). The puts then prints them out.

A real HTML/XML Parser - using Perl

You can also do something similar with Perl and the HTML::TreeBuilder::XPath module.

$ cat title_getter.pl
#!/usr/bin/perl

use HTML::TreeBuilder::XPath;

$tree = HTML::TreeBuilder::XPath->new_from_url($ARGV[0]); 
($title = $tree->findvalue('//title')) =~ s/^\s+//;
print $title . "\n";

You can then run this script like so:

$ ./title_getter.pl http://www.jake8us.org/~sam/multi-line.html
this is a \n title

@coffeMug - thanks, seemed more straight forward than the other 2 methods. — slm, Dec 01 '13 at 17:47
Parsing HTML with regular expressions is not so simple. Tags written as “”, “<title lang="en">”, “<title>” will not be matched by your expression. Even bigger problem, neither “<title>\noops\n” will be. — manatwork, Dec 01 '13 at 18:00
@manatwork - thanks, true, the case sensitivity can be dealt w/ easily w/ the `-i` switch to `grep`. The other items you noted will definitely be trickier. They could be dealt with but you'll have to deal with them as you run into them. The `\n` could be incorporated into the regex. — slm, Dec 01 '13 at 18:04
@manatwork - fixed the newline issue & the case sensitivity. — slm, Dec 01 '13 at 18:39
Attempting to parse html using regex [tends to be frowned upon](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) round here. — user3490, Dec 01 '13 at 21:50
@user3490 - it's perfectly fine to do so, assuming you understand the risks! — slm, Dec 01 '13 at 23:04
The regex could be: `(?<=]*>\s*)([^<]*)` which assumes: 1) there can be anything inside the open `<title>` tag, including but not limited to a `lang` attribute. 2) no other tag tag should be inside the `<title>` pair. `[^<]*` might fail for malformed html in which there are tags inside the title, but, on the other hand, works better than `.*` if there are line breaks. — Carlos Eugenio Thompson Pinzón, Dec 01 '13 at 23:40
Note that `-o` is GNU `grep` only. `-P` is only on GNU greps built with PCRE support. `\n` should be converted to space, not removed, `lang` is not the only attribute the `title` tag supports, and the proper syntax is `` anyway. — Stéphane Chazelas, Dec 02 '13 at 12:26
@StephaneChazelas - thanks for the feedback, what's the harm in removing `\n`? — slm, Dec 02 '13 at 13:06
@slm, `Unix\nLinux` is meant to be `Unix Linux`, not `UnixLinux`. — Stéphane Chazelas, Dec 02 '13 at 13:10

score 7 · Answer 4 · edited Dec 02 '13 at 16:25

7

Using simple regex to parse HTML is naive. E.g. with newlines and ignoring special character encoding specified in the file. Do the right thing and really parse the page using any of the other real parsers mentioned in the other answers or use the following one liner:

python -c "import bs4, urllib2; print bs4.BeautifulSoup(urllib2.urlopen('http://www.crummy.com/software/BeautifulSoup/bs4/doc/')).title.text"

(The above includes a Unicode character).

BeautifulSoup handles a lot of incorrect HTML (e.g. missing closing tags) as well, that would completely throw of simplistic regexing. You can install it in a standard python using:

pip install beautifulsoup4

or if you don't have pip, with

easy_install beautifulsoup4

Some operating systems like Debian/Ubuntu also have it packaged (python-bs4 package on Debian/Ubuntu).

edited Dec 02 '13 at 16:25

Stéphane Chazelas

522,931
91
1,010
1,501

answered Dec 02 '13 at 07:57

Zelda

6,158
1
21
27

2

`bs4` is not in the python standard library. You have to install it using `easy_install beautfulsoup4` (not `easyinstall bs4`). – Anthon Dec 02 '13 at 08:07
@Anthon included your info – Zelda Dec 02 '13 at 08:20
Prints the `SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?` error on Python 3.11.3 / beautifulsoup4-4.12.2-1 – user598527 Jun 30 '23 at 07:47
[A more recent answer](https://unix.stackexchange.com/a/563920/172800) may work, I tested with `python -c "import bs4, urllib2; print bs4.BeautifulSoup(urllib2.urlopen('https://example.com')).title.text"` – user598527 Jun 30 '23 at 07:49

score 6 · Answer 5 · answered Aug 19 '15 at 23:28

Maybe it's "cheating" but one option is pup, a command line HTML parser.

Here are two ways to do it:

Using the meta field with property="og:title attribute

$ wget -q 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' -O - | \
> pup 'meta[property=og:title] attr{content}'
Why Are Bad Words Bad?

and another way using the title field directly (and then lopping off the - YouTube string at the end).

$ wget -q 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' -O - | \
> pup 'title text{}' | sed 's/ - YouTube$//'
Why Are Bad Words Bad?

To avoid character entities, users may wish to use pup's `--plain` option. — peak, Jun 21 '17 at 07:57

score 5 · Answer 6 · answered May 13 '15 at 11:54

5

Simple way:

curl -s example.com | grep -o "<title>[^<]*" | tail -c+8

Few alternatives:

curl -s example.com | grep -o "<title>[^<]*" | cut -d'>' -f2-
wget -qO- example.com | grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'

answered May 13 '15 at 11:54

kenorb

20,250
14
140
164

1

These are the only ones that worked for me! – Ahmad Awais Aug 05 '16 at 22:05
All three sadly struggle with characters such as `&` and `"` , displayed as `&` and `"` – user598527 Jun 10 '22 at 11:06

Stéphane Chazelas · Answer 7 · 2020-01-24T18:53:00.543

It seems to be possible with lynx using this trick:

lynx 3>&1 > /dev/null -nopause -noprint -accept_all_cookies \
  -cmd_script /dev/stdin<<'EOF' 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc'
set PRINTER=P:printf '%0s\\n' "$LYNX_PRINT_TITLE">&3:TRUE
key p
key Select key
key ^J
exit
EOF

Because that's a real life web browser, it doesn't suffer from many of the limitations I mention in my other answer.

Here, we're using the fact that lynx sets the $LYNX_PRINT_TITLE environment variable to the title of the current page when printing the page.

Above, we use lynx scripting facility (with the script passed on stdin via a heredocument) to:

define a lynx "printer" called P that just outputs the content of that variable to file descriptor 3 (that file descriptor is redirected to lynx's stdout with 3>&1 while lynx stdout is itself redirected to /dev/null).
simulate the user pressing p, and the End (aka Select), and Enter (^J).

score 2 · Answer 8 · answered Jan 24 '20 at 17:25

2

A python3 + beautifulsoup example might be

python3 -c "import bs4, requests; print(bs4.BeautifulSoup(requests.get('http://www.crummy.com/software/BeautifulSoup/bs4/doc/').content).title.text)"

answered Jan 24 '20 at 17:25

Nik

21
1

Maxim Masiutin · Answer 9 · 2017-05-05T14:36:11.723

I liked the idea of Stéphane Chazelas to use Lynx and LYNX_PRINT_TITLE, but that script didn't work for me under Ubuntu 14.04.5.

I have made a simplified version of it by using running Lynx and using files pre-configured in advance.

Add the following line to /etc/lynx-cur/lynx.cfg (or wherever your lynx.cfg resides):

PRINTER:P:printenv LYNX_PRINT_TITLE>/home/account/title.txt:TRUE:1000

This line instructs to save title, while printing, to "/home/account/title.txt" - you may choose any file name you wish. You request VERY large pages, increase the above value from "1000" to any number of lines per page you want, otherwise Lynx will make additional prompt "when printing document containing very large number of pages".

Then create the /home/account/lynx-script.txt file with the following contents:

key p
key Select key
key ^J
exit

Then run Lynx using the following command-line options:

lynx -term=vt100 -display_charset=utf-8 -nopause -noprint -accept_all_cookies -cmd_script=/home/account/lynx-script.txt "http://www.youtube.com/watch?v=Dd7dQh8u4Hc" >/dev/nul

Upon completion of this command, the file /home/account/title.txt will be created with the title of your page.

Long story short, here is a PHP function that returns a page title based on the given URL, or false in case of error.

function GetUrlTitle($url)
{
  $title_file_name = "/home/account/title.txt";
  if (file_exists($title_file_name)) unlink($title_file_name); // delete the file if exists
  $cmd = '/usr/bin/lynx -cfg=/etc/lynx-cur/lynx.cfg -term=vt100 -display_charset=utf-8 -nopause -noprint -accept_all_cookies -cmd_script=/home/account/lynx-script.txt "'.$url.'"';
  exec($cmd, $output, $retval);
  if (file_exists($title_file_name))
  {
    $title = file_get_contents($title_file_name);
    unlink($title_file_name); // delete the file after reading
    return $title;
  } else
  {
    return false;
  }
}

print GetUrlTitle("http://www.youtube.com/watch?v=Dd7dQh8u4Hc");

score 1 · Answer 10 · answered Jun 21 '17 at 08:11

Using nokogiri, one can use a simple CSS-based query to extract the inner text of the tag:

 $ nokogiri -e 'puts $_.at_css("title").content'
 Why Are Bad Words Bad? - YouTube

Similarly, to extract the value of the "content" attribute of the tag:

$ nokogiri -e 'puts $_.at_css("meta[name=title]").attr("content")'
Why Are Bad Words Bad?

JJoao · Answer 11 · 2020-01-24T18:54:55.533

0

Using xidel:

$ xidel -s http://www.youtube.com/watch?v=Dd7dQh8u4Hc --css title
Why Are Bad Words Bad? - YouTube

If necessary, apt install xidel or similar.

edited Jan 24 '20 at 18:54

answered Jan 24 '20 at 18:33

JJoao

11,887
1
22
44

score 0 · Answer 12 · edited Jan 15 '23 at 23:58

0

Using htmlq:

curl --silent "https://www.youtube.com/watch?v=Dd7dQh8u4Hc" | htmlq --text title

If you do not have it installed, htmlq can be built with cargo:

cargo install htmlq

edited Jan 15 '23 at 23:58

Edgar Magallon

4,711
2
12
27

answered Jan 13 '23 at 18:12

Michael Skyba

1
1

How do I get a websites title using command line?

12 Answers12

portability

HTTP protocol and redirection handling

Performance/Efficiency

Parsing of the HTML

interpretation of the code inside the tags.

Character set

To sum up

Example

Details

More complex situations

Example

lang=...

A real HTML/XML Parser - using Ruby

A real HTML/XML Parser - using Perl