Sunday, May 20, 2012

Crawl a Site after Login


Introduction

This post is about how to handle cookie and post request to login and crawl some private content.

Prerequest

The LoginCrawler is based on SimpleCrawler, please check it first
http://ben-bai.blogspot.com/2012/04/java-simple-web-crawler.html

About the form post

Assume a form in a page as follows
<form>
<input name="userName" />
<input name="passWord" />
<form>

your user name is 'someone' and password is '123', to post request to login,
the parameters is "userName=someone&passWord=123".

Please note the flow of different site may different, the LoginCrawler is just tested with MediaWiki system.

The Program

LoginCrawler
https://github.com/benbai123/JSP_Servlet_Practice/blob/master/Practice/JAVA/Net/src/test/LoginCrawler.java

Download
SimpleCrawler
https://github.com/benbai123/JSP_Servlet_Practice/blob/master/Practice/JAVA/Net/src/test/SimpleCrawler.java

LoginCrawler
https://github.com/benbai123/JSP_Servlet_Practice/blob/master/Practice/JAVA/Net/src/test/LoginCrawler.java

Reference
http://docs.oracle.com/javase/1.5.0/docs/guide/deployment/deployment-guide/cookie_support.html