Results 1 to 6 of 6

Thread: How to write web crawler

  1. #1
    Join Date
    Dec 2009
    Posts
    20

    How to write web crawler

    Hi,
    I am in trouble want your help, I am web developer and currently I working on one site maximum site will finished and now my client telling me to add one more feature and it’s requirement is he wants web crawler and as per my knowledge I don’t have any knowledge about it can anyone give me source code for this and before that first tell me How to write web crawler?

  2. #2
    Join Date
    Apr 2008
    Posts
    2,277

    Re: How to write web crawler

    I tell you what web crawler is, A Web crawler is a computer program which is help to browses the World Wide Web in a well manner flowcharts. And also it is useful for ants, automatic indexers, bots, and worms or Web spider. This process also called spidering. A Web crawler is one type of bot it will help to starts with a list of URLs to visit and it’s called the seeds. I think this information will help you to increase your web crawler information.

  3. #3
    Join Date
    May 2008
    Posts
    2,792

    Re: How to write web crawler

    Hey use this code for writing web crawler I write this code in java, just copy past the code in your web page. I think using this code you able to fulfill your client requirement I use this in my own web site.

    import java.applet.Applet;
    import java.text.*;
    import java.awt.*;
    import java.awt.event.*;
    import java.util.*;
    import java.net.*;
    import java.io.*;

    public class WebCrawler extends Applet implements ActionListener, Sttarttnable {
    public static final String SEARCH = "Search";
    public static final String STOP = "Stop";
    public static final String DISALLOW = "Disallow:";
    public static final int SEARCH_LIMIT = 50;

    Panel pnlmain;
    List lstmtch;
    Label lblstus;
    Vector vctrsrch;
    Vector vcteserched;
    Vector vctrmatch;

    Thread srchthrd;

    TextField txturl;
    Choice chtyp;

    public void init() {
    pnlmain = new Panel();
    pnlmain.setLayout(new BorderLayout(5, 5));

    Panel panelEntry = new Panel();
    panelEntry.setLayout(new BorderLayout(5, 5));

    Panel panelURL = new Panel();
    panelURL.setLayout(new FlowLayout(FlowLayout.LEFT, 5, 5));
    Label labelURL = new Label("Strting URL: ", Label.RIGHT);
    panelURL.add(labelURL);
    txturl = new TextField("", 40);
    panelURL.add(txturl);
    panelEntry.add("North", panelURL);

    Panel panelType = new Panel();
    panelType.setLayout(new FlowLayout(FlowLayout.LEFT, 5, 5));
    Label labelType = new Label("Content type: ", Label.RIGHT);
    panelType.add(labelType);
    chtyp = new Choice();
    chtyp.addItem("html");
    chtyp.addItem("basic");
    chtyp.addItem("au");
    chtyp.addItem("aiff");
    chtyp.addItem("wav");
    chtyp.addItem("mpeg");
    chtyp.addItem("x-avi");
    panelType.add(chtyp);
    panelEntry.add("South", panelType);

    pnlmain.add("North", panelEntry);

    Panel panelListButtons = new Panel();
    panelListButtons.setLayout(new BorderLayout(5, 5));

    Panel panelList = new Panel();
    panelList.setLayout(new BorderLayout(5, 5));
    Label labelResults = new Label("Search results");
    panelList.add("North", labelResults);
    Panel panelListCurrent = new Panel();
    panelListCurrent.setLayout(new BorderLayout(5, 5));
    lstmtch = new List(10);
    panelListCurrent.add("North", lstmtch);
    lblstus = new Label("");
    panelListCurrent.add("South", lblstus);
    panelList.add("South", panelListCurrent);

    panelListButtons.add("North", panelList);

    Panel pnlbutn = new Panel();
    Button btnSearch = new Button(SEARCH);
    btnSearch.addActionListener(this);
    pnlbutn.add(btnSearch);
    Button buttonStop = new Button(STOP);
    buttonStop.addActionListener(this);
    pnlbutn.add(buttonStop);

    panelListButtons.add("South", pnlbutn);

    pnlmain.add("South", panelListButtons);

    add(pnlmain);
    setVisible(true);

    repaint();
    vctrsrch = new Vector();
    vcteserched = new Vector();
    vctrmatch = new Vector();
    URLConnection.setDefaultAllowUserInteraction(false);
    }

    public void strt() {
    }

    public void stop() {
    if (srchthrd != null) {
    setStatus("stop");
    srchthrd = null;
    }
    }

    public void destroy() {
    }

    boolean rbtsafe(URL url) {
    String strHost = url.getHost();
    String strRobot = "http://" + strHost + "/robots.txt";
    URL urlRobot;
    try {
    urlRobot = new URL(strRobot);
    } catch (MalformedURLException e) {
    return false;
    }

    String strCommands;
    try {
    InputStream urlRobotStream = urlRobot.openStream();

    byte b[] = new byte[1000];
    int numRead = urlRobotStream.read(b);
    strCommands = new String(b, 0, numRead);
    while (numRead != -1) {
    if (Thread.currentThread() != srchthrd)
    break;
    numRead = urlRobotStream.read(b);
    if (numRead != -1) {
    String newCommands = new String(b, 0, numRead);
    strCommands += newCommands;
    }
    }
    urlRobotStream.close();
    } catch (IOException e) {
    return true;
    }

    String strURL = url.getFile();
    int index = 0;
    while ((index = strCommands.indexOf(DISALLOW, index)) != -1) {
    index += DISALLOW.length();
    String strPath = strCommands.substring(index);
    StringTokenizer st = new StringTokenizer(strPath);

    if (!st.mrtokn())
    break;

    String wrongpath = st.nxttkn();
    if (strURL.indexOf(wrongpath) == 0)
    return false;
    }

    return true;
    }

    public void paint(Graphics g) {
    g.drawRect(0, 0, getSize().width - 1, getSize().height - 1);

    pnlmain.paint(g);
    pnlmain.paintComponents(g);
    }

    public void sttartt() {
    String strURL = txturl.getText();
    String strTargetType = chtyp.getSelectedItem();
    int numberSearched = 0;
    int numberFound = 0;

    if (strURL.length() == 0) {
    setStatus("ERROR: must enter a strting URL");
    return;
    }
    vctrsrch.removeAllElements();
    vcteserched.removeAllElements();
    vctrmatch.removeAllElements();
    lstmtch.removeAll();

    vctrsrch.addElement(strURL);

    while ((vctrsrch.size() > 0)
    && (Thread.currentThread() == srchthrd)) {
    strURL = (String) vctrsrch.elementAt(0);

    setStatus("searching " + strURL);

    URL url;
    try {
    url = new URL(strURL);
    } catch (MalformedURLException e) {
    setStatus("ERROR: invalid URL " + strURL);
    break;
    }

    vctrsrch.removeElementAt(0);
    vcteserched.addElement(strURL);
    if (url.getProtocol().compareTo("http") != 0)
    break;
    if (!rbtsafe(url))
    break;

    try {
    URLConnection urlConnection = url.openConnection();

    urlConnection.setAllowUserInteraction(false);

    InputStream urlStream = url.openStream();
    String type
    = urlConnection.guessContentTypeFromStream(urlStream);
    if (type == null)
    break;
    if (type.compareTo("text/html") != 0)
    break;

    byte b[] = new byte[1000];
    int numRead = urlStream.read(b);
    String content = new String(b, 0, numRead);
    while (numRead != -1) {
    if (Thread.currentThread() != srchthrd)
    break;
    numRead = urlStream.read(b);
    if (numRead != -1) {
    String newContent = new String(b, 0, numRead);
    content += newContent;
    }
    }
    urlStream.close();

    if (Thread.currentThread() != srchthrd)
    break;

    String lowerCaseContent = content.toLowerCase();

    int index = 0;
    while ((index = lowerCaseContent.indexOf("<a", index)) != -1)
    {
    if ((index = lowerCaseContent.indexOf("href", index)) == -1)
    break;
    if ((index = lowerCaseContent.indexOf("=", index)) == -1)
    break;

    if (Thread.currentThread() != srchthrd)
    break;

    index++;
    String remaining = content.substring(index);

    StringTokenizer st
    = new StringTokenizer(remaining, "\t\n\r\">#");
    String strLink = st.nxttkn();

    URL urlLink;
    try {
    urlLink = new URL(url, strLink);
    strLink = urlLink.toString();
    } catch (MalformedURLException e) {
    setStatus("ERROR: bad URL " + strLink);
    continue;
    }

    if (urlLink.getProtocol().compareTo("http") != 0)
    break;

    if (Thread.currentThread() != srchthrd)
    break;

    try {

    URLConnection urlLinkConnection
    = urlLink.openConnection();
    urlLinkConnection.setAllowUserInteraction(false);
    InputStream linkStream = urlLink.openStream();
    String strType
    = urlLinkConnection.guessContentTypeFromStream(linkStream);
    linkStream.close();

    if (strType == null)
    break;
    if (strType.compareTo("text/html") == 0) {
    if ((!vcteserched.contains(strLink))
    && (!vctrsrch.contains(strLink))) {

    if (rbtsafe(urlLink))
    vctrsrch.addElement(strLink);
    }
    }
    if (strType.compareTo(strTargetType) == 0) {
    if (vctrmatch.contains(strLink) == false) {
    lstmtch.add(strLink);
    vctrmatch.addElement(strLink);
    numberFound++;
    if (numberFound >= SEARCH_LIMIT)
    break;
    }
    }
    } catch (IOException e) {
    setStatus("ERROR: couldn't open URL " + strLink);
    continue;
    }
    }
    } catch (IOException e) {
    setStatus("ERROR: couldn't open URL " + strURL);
    break;
    }

    numberSearched++;
    if (numberSearched >= SEARCH_LIMIT)
    break;
    }

    if (numberSearched >= SEARCH_LIMIT || numberFound >= SEARCH_LIMIT)
    setStatus("reached search limit of " + SEARCH_LIMIT);
    else
    setStatus("done");
    srchthrd = null;
    }

    void setStatus(String status) {
    lblstus.setText(status);
    }

    public void actionPerformed(ActionEvent event) {
    String command = event.getActionCommand();

    if (command.compareTo(SEARCH) == 0) {
    setStatus("searching...");

    if (srchthrd == null) {
    srchthrd = new Thread(this);
    }
    srchthrd.strt();
    }
    else if (command.compareTo(STOP) == 0) {
    stop();
    }
    }
    public static void main (String argv[])
    {
    Frame f = new Frame("frme");
    WebCrawler applet = new WebCrawler();
    f.add("cntr", applet);

    Properties prpps= new Properties(System.getProperties());
    prpps.put("http.proxySet", "true");
    prpps.put("http.proxyHost", "webcache-cup");
    prpps.put("http.proxyPort", "8080");

    Properties newprpps = new Properties(prpps);
    System.setProperties(newprpps);


    applet.init();
    applet.strt();
    f.pack();
    f.show();
    }

    }

  4. #4
    Join Date
    Apr 2008
    Posts
    2,572

    Re: How to write web crawler

    i write this same application in c# if you able to use this in your web site the use, if you use this then your web page become faster loader on browser or if you use above code then it will take long time to compile and because of this i suggest you to use this code.
    Attached Files Attached Files

  5. #5
    Join Date
    Apr 2008
    Posts
    2,572

    Re: How to write web crawler

    hey i think instead of using above java code download following attachment it will really helpful for you because it's take short time to upload and load on the browser.
    Attached Files Attached Files

  6. #6
    Join Date
    May 2011
    Posts
    1

    Re: How to write web crawler // inside + outside IANA root zone

    I see no command line in above crawling that limits webcrawling to IANA rootzone extensions.

    Why is it in practice that I can never find websites with alternative domain names (like guitar.music, visible thru sundialbrowser.com that leads to alternative DNS) in the 'normal' search engines?

    Bad luck?

    Would above web crawler, integrated in a search engine, index websites as guitar.music provided there are enough hyperlinks to it, and it contains unique and relevant text?

    jpblankert at zonnet dot nl (if someone can program a search engine that indexes sites like guitar.music: please let me know, paid project!)

Similar Threads

  1. Which is the best Dungeon Crawler
    By GoutamB in forum Video Games
    Replies: 6
    Last Post: 29-09-2011, 10:33 PM
  2. Is it safe to install Crawler Toolbar?
    By bAALAaDITYA in forum Networking & Security
    Replies: 4
    Last Post: 21-05-2011, 10:40 AM
  3. How does a web crawler works
    By Elbanco in forum Technology & Internet
    Replies: 6
    Last Post: 11-01-2010, 12:43 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Page generated in 1,714,808,189.14455 seconds with 17 queries