8

One major shortcoming of curl is that more and more wepages are having their main piece of content painted by a JavaScript AJAX response that occurs after the initial HTTP response. curl never picks up on this post-painted content.

So to fetch these types of webpages from the command line, I've been reduced to writing scripts in Ruby that drive the SeleniumRC to fire up a Firefox instance and then return the source HTML after these AJAX calls have completed.

It would be much better to have a leaner command line solution for this type of problem. Does anyone know of any?

dan
  • 4,007
  • 5
  • 26
  • 34
  • No one's suggested anything else on [Does anybody here have experience in automating some tasks in web applications using curl?](http://unix.stackexchange.com/questions/11296/does-anybody-here-have-experience-in-automating-some-tasks-in-web-applications-us), but that question wasn't specifically asking about scraping Javascript. – Gilles 'SO- stop being evil' Apr 28 '11 at 21:54

2 Answers2

3

Have you considered Watir?

http://watir.com/

When you've added the package, you can run it as a standalone file or from irb, line-by-line after include 'watir-webdriver'. I've found it to be more responsive than selenium-webdriver, but without the test recording GUI to help work out complex test conditions.

Kevin
  • 40,087
  • 16
  • 88
  • 112
2

I just recently started using the WebDriver from Selenium 2 in Java. There is a driver called HtmlUnitDriver that fully supports JavaScript but does not fire up an actual browser.

It is not a light solution but it does get the job done.

I've designed the code to run from the command line and save the web data to files.

Michael Gantz
  • 356
  • 1
  • 3
  • 9