Posts Tagged ‘urllib2

28
Oct
09

CAPTCHA scraping – Opening urls with urllib2 over an https proxy

I’m currently trying to scrape CAPTCHA images over Tor (my own personal botnet ;) ) for certain sites that maintain aggressive forms of rate limiting (i.e. Yahoo/Ebay). Strangely enough Google and Microsoft do not care as much. They probably spend more time solving problems like…having a good search engine?

Most CAPTCHA’s appear on https:// protected pages, hence the need for a https proxy to use with Tor. I tried Polipo, which doesn’t support https proxying (which I soon found out). However, Privoxy does.

Once I setup the actual proxy itself to connect to my Tor router, I find out that currently, most linux packages of python (and Mac OS X Snow Leopard) do not contain the necessary patch to support opening urls over an https proxy. You need svn revision numbers 72880 and up (any version released roughly after July of 2009) for python 2.6. The patch is also somewhere in 3.x. Linux package management maintainers really need to push up to date versions with backported fixes more frequently or quickly.

Details on the bug reporting and fix are here:

http://bugs.python.org/issue1424152

Note that the correct way of using the https proxy code is to call

proxy_support = urllib2.ProxyHandler({“https” : “https://127.0.0.1:8118″})

opener = urllib2.build_opener(proxy_support)

Why is https proxying so poorly supported? It boggles my mind. I guess open source developers don’t run into this problem often.

This has been a public service announcement by the Foundation for Annoyed Programmers (FAP).




 

July 2010
M T W T F S S
« Feb    
 1234
567891011
12131415161718
19202122232425
262728293031