Go Back   TechArena Community > Technology > Technology & Internet
Become a Member!
Forgot your username/password?
Register Tags Active Topics RSS Search Mark Forums Read SiteMap

Tags: , , , ,

Sponsored Links



How to write web crawler

Technology & Internet


Reply
 
Thread Tools Search this Thread
  #1  
Old 30-01-2010
Member
 
Join Date: Dec 2009
Posts: 20
How to write web crawler

Hi,
I am in trouble want your help, I am web developer and currently I working on one site maximum site will finished and now my client telling me to add one more feature and it’s requirement is he wants web crawler and as per my knowledge I don’t have any knowledge about it can anyone give me source code for this and before that first tell me How to write web crawler?
Reply With Quote
  #2  
Old 30-01-2010
Jackson2's Avatar
Member
 
Join Date: Apr 2008
Posts: 2,268
Re: How to write web crawler

I tell you what web crawler is, A Web crawler is a computer program which is help to browses the World Wide Web in a well manner flowcharts. And also it is useful for ants, automatic indexers, bots, and worms or Web spider. This process also called spidering. A Web crawler is one type of bot it will help to starts with a list of URLs to visit and it’s called the seeds. I think this information will help you to increase your web crawler information.
Reply With Quote
  #3  
Old 30-01-2010
Trio's Avatar
Member
 
Join Date: May 2008
Posts: 2,754
Re: How to write web crawler

Hey use this code for writing web crawler I write this code in java, just copy past the code in your web page. I think using this code you able to fulfill your client requirement I use this in my own web site.

Quote:
import java.applet.Applet;
import java.text.*;
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import java.net.*;
import java.io.*;

public class WebCrawler extends Applet implements ActionListener, Sttarttnable {
public static final String SEARCH = "Search";
public static final String STOP = "Stop";
public static final String DISALLOW = "Disallow:";
public static final int SEARCH_LIMIT = 50;

Panel pnlmain;
List lstmtch;
Label lblstus;
Vector vctrsrch;
Vector vcteserched;
Vector vctrmatch;

Thread srchthrd;

TextField txturl;
Choice chtyp;

public void init() {
pnlmain = new Panel();
pnlmain.setLayout(new BorderLayout(5, 5));

Panel panelEntry = new Panel();
panelEntry.setLayout(new BorderLayout(5, 5));

Panel panelURL = new Panel();
panelURL.setLayout(new FlowLayout(FlowLayout.LEFT, 5, 5));
Label labelURL = new Label("Strting URL: ", Label.RIGHT);
panelURL.add(labelURL);
txturl = new TextField("", 40);
panelURL.add(txturl);
panelEntry.add("North", panelURL);

Panel panelType = new Panel();
panelType.setLayout(new FlowLayout(FlowLayout.LEFT, 5, 5));
Label labelType = new Label("Content type: ", Label.RIGHT);
panelType.add(labelType);
chtyp = new Choice();
chtyp.addItem("html");
chtyp.addItem("basic");
chtyp.addItem("au");
chtyp.addItem("aiff");
chtyp.addItem("wav");
chtyp.addItem("mpeg");
chtyp.addItem("x-avi");
panelType.add(chtyp);
panelEntry.add("South", panelType);

pnlmain.add("North", panelEntry);

Panel panelListButtons = new Panel();
panelListButtons.setLayout(new BorderLayout(5, 5));

Panel panelList = new Panel();
panelList.setLayout(new BorderLayout(5, 5));
Label labelResults = new Label("Search results");
panelList.add("North", labelResults);
Panel panelListCurrent = new Panel();
panelListCurrent.setLayout(new BorderLayout(5, 5));
lstmtch = new List(10);
panelListCurrent.add("North", lstmtch);
lblstus = new Label("");
panelListCurrent.add("South", lblstus);
panelList.add("South", panelListCurrent);

panelListButtons.add("North", panelList);

Panel pnlbutn = new Panel();
Button btnSearch = new Button(SEARCH);
btnSearch.addActionListener(this);
pnlbutn.add(btnSearch);
Button buttonStop = new Button(STOP);
buttonStop.addActionListener(this);
pnlbutn.add(buttonStop);

panelListButtons.add("South", pnlbutn);

pnlmain.add("South", panelListButtons);

add(pnlmain);
setVisible(true);

repaint();
vctrsrch = new Vector();
vcteserched = new Vector();
vctrmatch = new Vector();
URLConnection.setDefaultAllowUserInteraction(false);
}

public void strt() {
}

public void stop() {
if (srchthrd != null) {
setStatus("stop");
srchthrd = null;
}
}

public void destroy() {
}

boolean rbtsafe(URL url) {
String strHost = url.getHost();
String strRobot = "http://" + strHost + "/robots.txt";
URL urlRobot;
try {
urlRobot = new URL(strRobot);
} catch (MalformedURLException e) {
return false;
}

String strCommands;
try {
InputStream urlRobotStream = urlRobot.openStream();

byte b[] = new byte[1000];
int numRead = urlRobotStream.read(b);
strCommands = new String(b, 0, numRead);
while (numRead != -1) {
if (Thread.currentThread() != srchthrd)
break;
numRead = urlRobotStream.read(b);
if (numRead != -1) {
String newCommands = new String(b, 0, numRead);
strCommands += newCommands;
}
}
urlRobotStream.close();
} catch (IOException e) {
return true;
}

String strURL = url.getFile();
int index = 0;
while ((index = strCommands.indexOf(DISALLOW, index)) != -1) {
index += DISALLOW.length();
String strPath = strCommands.substring(index);
StringTokenizer st = new StringTokenizer(strPath);

if (!st.mrtokn())
break;

String wrongpath = st.nxttkn();
if (strURL.indexOf(wrongpath) == 0)
return false;
}

return true;
}

public void paint(Graphics g) {
g.drawRect(0, 0, getSize().width - 1, getSize().height - 1);

pnlmain.paint(g);
pnlmain.paintComponents(g);
}

public void sttartt() {
String strURL = txturl.getText();
String strTargetType = chtyp.getSelectedItem();
int numberSearched = 0;
int numberFound = 0;

if (strURL.length() == 0) {
setStatus("ERROR: must enter a strting URL");
return;
}
vctrsrch.removeAllElements();
vcteserched.removeAllElements();
vctrmatch.removeAllElements();
lstmtch.removeAll();

vctrsrch.addElement(strURL);

while ((vctrsrch.size() > 0)
&& (Thread.currentThread() == srchthrd)) {
strURL = (String) vctrsrch.elementAt(0);

setStatus("searching " + strURL);

URL url;
try {
url = new URL(strURL);
} catch (MalformedURLException e) {
setStatus("ERROR: invalid URL " + strURL);
break;
}

vctrsrch.removeElementAt(0);
vcteserched.addElement(strURL);
if (url.getProtocol().compareTo("http") != 0)
break;
if (!rbtsafe(url))
break;

try {
URLConnection urlConnection = url.openConnection();

urlConnection.setAllowUserInteraction(false);

InputStream urlStream = url.openStream();
String type
= urlConnection.guessContentTypeFromStream(urlStream);
if (type == null)
break;
if (type.compareTo("text/html") != 0)
break;

byte b[] = new byte[1000];
int numRead = urlStream.read(b);
String content = new String(b, 0, numRead);
while (numRead != -1) {
if (Thread.currentThread() != srchthrd)
break;
numRead = urlStream.read(b);
if (numRead != -1) {
String newContent = new String(b, 0, numRead);
content += newContent;
}
}
urlStream.close();

if (Thread.currentThread() != srchthrd)
break;

String lowerCaseContent = content.toLowerCase();

int index = 0;
while ((index = lowerCaseContent.indexOf("<a", index)) != -1)
{
if ((index = lowerCaseContent.indexOf("href", index)) == -1)
break;
if ((index = lowerCaseContent.indexOf("=", index)) == -1)
break;

if (Thread.currentThread() != srchthrd)
break;

index++;
String remaining = content.substring(index);

StringTokenizer st
= new StringTokenizer(remaining, "\t\n\r\">#");
String strLink = st.nxttkn();

URL urlLink;
try {
urlLink = new URL(url, strLink);
strLink = urlLink.toString();
} catch (MalformedURLException e) {
setStatus("ERROR: bad URL " + strLink);
continue;
}

if (urlLink.getProtocol().compareTo("http") != 0)
break;

if (Thread.currentThread() != srchthrd)
break;

try {

URLConnection urlLinkConnection
= urlLink.openConnection();
urlLinkConnection.setAllowUserInteraction(false);
InputStream linkStream = urlLink.openStream();
String strType
= urlLinkConnection.guessContentTypeFromStream(linkStream);
linkStream.close();

if (strType == null)
break;
if (strType.compareTo("text/html") == 0) {
if ((!vcteserched.contains(strLink))
&& (!vctrsrch.contains(strLink))) {

if (rbtsafe(urlLink))
vctrsrch.addElement(strLink);
}
}
if (strType.compareTo(strTargetType) == 0) {
if (vctrmatch.contains(strLink) == false) {
lstmtch.add(strLink);
vctrmatch.addElement(strLink);
numberFound++;
if (numberFound >= SEARCH_LIMIT)
break;
}
}
} catch (IOException e) {
setStatus("ERROR: couldn't open URL " + strLink);
continue;
}
}
} catch (IOException e) {
setStatus("ERROR: couldn't open URL " + strURL);
break;
}

numberSearched++;
if (numberSearched >= SEARCH_LIMIT)
break;
}

if (numberSearched >= SEARCH_LIMIT || numberFound >= SEARCH_LIMIT)
setStatus("reached search limit of " + SEARCH_LIMIT);
else
setStatus("done");
srchthrd = null;
}

void setStatus(String status) {
lblstus.setText(status);
}

public void actionPerformed(ActionEvent event) {
String command = event.getActionCommand();

if (command.compareTo(SEARCH) == 0) {
setStatus("searching...");

if (srchthrd == null) {
srchthrd = new Thread(this);
}
srchthrd.strt();
}
else if (command.compareTo(STOP) == 0) {
stop();
}
}
public static void main (String argv[])
{
Frame f = new Frame("frme");
WebCrawler applet = new WebCrawler();
f.add("cntr", applet);

Properties prpps= new Properties(System.getProperties());
prpps.put("http.proxySet", "true");
prpps.put("http.proxyHost", "webcache-cup");
prpps.put("http.proxyPort", "8080");

Properties newprpps = new Properties(prpps);
System.setProperties(newprpps);


applet.init();
applet.strt();
f.pack();
f.show();
}

}
Reply With Quote
  #4  
Old 30-01-2010
deveritt's Avatar
Member
 
Join Date: Apr 2008
Posts: 2,525
Re: How to write web crawler

i write this same application in c# if you able to use this in your web site the use, if you use this then your web page become faster loader on browser or if you use above code then it will take long time to compile and because of this i suggest you to use this code.
Attached Files
File Type: zip Crawlerweb.zip (11.9 KB, 35 views)
Reply With Quote
  #5  
Old 30-01-2010
deveritt's Avatar
Member
 
Join Date: Apr 2008
Posts: 2,525
Re: How to write web crawler

hey i think instead of using above java code download following attachment it will really helpful for you because it's take short time to upload and load on the browser.
Attached Files
File Type: rar webcrawler.rar (13.3 KB, 41 views)
Reply With Quote
  #6  
Old 14-05-2011
Member
 
Join Date: May 2011
Posts: 1
Re: How to write web crawler // inside + outside IANA root zone

I see no command line in above crawling that limits webcrawling to IANA rootzone extensions.

Why is it in practice that I can never find websites with alternative domain names (like guitar.music, visible thru sundialbrowser.com that leads to alternative DNS) in the 'normal' search engines?

Bad luck?

Would above web crawler, integrated in a search engine, index websites as guitar.music provided there are enough hyperlinks to it, and it contains unique and relevant text?

jpblankert at zonnet dot nl (if someone can program a search engine that indexes sites like guitar.music: please let me know, paid project!)
Reply With Quote
Reply

  TechArena Community > Technology > Technology & Internet


Thread Tools Search this Thread
Search this Thread:

Advanced Search


Similar Threads for: "How to write web crawler"
Thread Thread Starter Forum Replies Last Post
Which is the best Dungeon Crawler GoutamB Video Games 6 29-09-2011 11:33 PM
Not able to find a proper dungeon crawler BillieJ Video Games 2 29-09-2011 01:07 PM
Is it safe to install Crawler Toolbar? bAALAaDITYA Networking & Security 4 21-05-2011 11:40 AM
Problems with Data Mining Crawler in Windows 7 CRiley Operating Systems 5 14-02-2010 01:45 AM
How does a web crawler works Elbanco Technology & Internet 6 11-01-2010 12:43 PM


All times are GMT +5.5. The time now is 04:37 AM.