scRUBYt! and WWW::Mechanize foiled (aka Sneaky Yahoo Scraping Prevention)

Screen scraping. It’s one of those techniques that can be loads of fun and also heaps of frustration. On a recent project, I was charged with scraping some pages from Yahoo! Shopping.

I Hate Screen Scraping

After coding up a quick skeleton, I was surprised to see that none of my initial tests were working – it was almost as if the data wasn’t even there. Checking out the rendered page in Firefox revealed the problem – none of the links I needed to scrape were present. At all. So, I hit up the Yahoo source code to verify that the links were in the original. Yup, right there. Weird!

I tried out a variety of my favorite scraping tools – scRUBYt!, WWW::Mechanize, and even good ol’ cURL. None of these tools could acquire HTML source from the Yahoo server with links intact, even when I provided a valid Firefox User-Agent string.

Next, I dropped down to an even lower level – packet sniffing with my favorite sniffer, Packetyzer, and sending HTTP 1.1 requests directly via telnet.

rich@redbuntu:~$ telnet shopping.yahoo.com 80
Trying 209.73.163.95...
Connected to pdb3.shop.yahoo.akadns.net.
Escape character is '^]'.
GET / HTTP/1.1
Host:shopping.yahoo.com
 
HTTP/1.1 200 OK
Date: Tue, 15 Apr 2008 17:48:16 GMT
P3P: policyref="http://p3p.yahoo.com/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"
Cache-Control: private
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
 
a17e
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
etc...

Three interesting things jumped out at me when I executed this request. FIrst, I didn’t recognize the P3P header – turns out this is just a compact privacy policy, so no dice there. Second, I noticed the absence of the Server header, which is usually used to identify the server’s software, version, installed modules, etc. Since Yahoo opted to hide this line, it seems increasingly possible that Yahoo is ‘gaming’ us. (And a only shows that they’re running FreeBSD with an unidentified web server) Third, I noticed that the response was being sent uncompressed. This made sense since I did not specify in my HTTP 1.1 request that my ‘client’ (telnet) supported compression. However, while sniffing packets earlier, I noticed that all of the HTTP responses were compressed. (Edit: Another interesting feature – “a17e”. What in the world is this?)

Those Sneaky Geeks

It sounded crazy, but I realized that Yahoo could be using clients’ compression support to differentiate between bots and actual web browsers. So, I popped open another terminal and tried making a request as I did before with cURL, except with a Firefox User-Agent string and compression enabled.

curl --user-agent "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.5) Gecko/20070713 FireFox/2.0.0.5" --compressed "http://shopping.yahoo.com/path/to/resource" > result.html

Bingo! All of the links were intact – the source was identical to that which a standard web browser would retrieve. So, it seems that Yahoo has written some sort of Apache module or perhaps just engineered their application code to vary its response according to whether or not the client supports compression. This is quite a sneaky way to deter search engine indexing and screen scraping in general, but it works wonderfully. In fact, if I were the engineer that devised this detection method, I’d be pretty pround of myself. Unfortunately, now that we know Yahoo’s secret, our scraping workflow just gets one more step: curl > result.html.

3 thoughts on “scRUBYt! and WWW::Mechanize foiled (aka Sneaky Yahoo Scraping Prevention)”

Comments are closed.