Results 1 to 6 of 6

Thread: Java program to retrieve html source

  1. #1
    Join Date
    Dec 2009
    Posts
    213

    Java program to retrieve html source

    Hello,
    I made a java method that retrieves the source code of an HTML page for a given address. The problem is that it does not recover properly accents etc. Do you guys have any idea ?
    URL loc = new URL(address);
    URLConnection urlcon = loc.openConnection();
    InputStream in = urlcon.getInputStream();
    int c = in.read();
    StringBuilder build = new StringBuilder();
    while (c! = -1) {
    build.append((char) c);
    c = in.read();
    }
    toreturn = build.toString();
    Last edited by Aaliya Seth; 12-01-2010 at 10:49 AM.

  2. #2
    Join Date
    Jan 2008
    Posts
    1,521

    Re: Java program to retrieve html source

    Hello,
    You must read the HTML specification:
    1. If a META tag defines the encoding, it must be taken into account.
    2. Otherwise, rely on the XML header (eventual).
    3. Otherwise, rely on the HTTP header.
    4. Otherwise, take the default encoding of the machine.
    You have to change the interpretation of your charset if you meet one of the cases.

  3. #3
    Join Date
    Dec 2009
    Posts
    213

    Re: Java program to retrieve html source

    Hello,
    Thank you for the answer, at first I will try to comment with a static encoding have good characters. How to take into account the charset when you retrieve the InputStream, so by default I want to put UTF-8? Should I change my way back? The problem is that I have too much choice in passage through URLConnection. Otherwise, even as regards the recovery of dynamic encoding. For the moment, with uc.getHeaderField ( "Content-Type") I can receive meta content. For XML I see it too. By cons I put UTF-8 by default if the test meta tags and xml are not good.

  4. #4
    Join Date
    Jan 2008
    Posts
    1,521

    Re: Java program to retrieve html source

    Heelo
    If you want it to be treated as UTF-8, it must send the page with UTF-8. For this, we must send a header with your server (header () function in PHP). And it requires in addition that your page is encoded in UTF-8 (Eclipse allows, and most text editors). For cons, I do not think you get the contents of the META tag with getHeaderField (). We must see the doc, but I'm pretty sure not.

  5. #5
    Join Date
    Dec 2009
    Posts
    213

    Re: Java program to retrieve html source

    Hello,
    Code:
    System.out.System.out.println("Content-Type: + Uc.getHeaderField("Content-Type"));
    I returned:
    text / html; charset = UTF-8
    Perhaps the meta tags are malformed because getContentEncoding () returns me null. Otherwise, my HTML page I send much I set UTF-8 in the encoding, but not that is the problem but before actually. Just when I get the page content, I stock the bad characters good StringBuffer. To verify I add:
    Code:
    System.out.System.out.println((char)c);

  6. #6
    Join Date
    May 2008
    Posts
    2,302

    Re: Java program to retrieve html source

    Hello,
    I think you have missed the Charset for the InputStreamReader, just view the code below.
    Code:
    try {
      InputStreamReader inputstr = new InputStreamReader( in, encoding );
      BufferedReader buffered = new BufferedReader( inputstr );
      String line;
      while ( ( line = buffered.readLine() ) != null ) 
        textHtml + = line;		    
    } finally {
      in.close();
    }

Similar Threads

  1. Replies: 1
    Last Post: 08-06-2012, 12:42 PM
  2. How to View HTML Source Code in Word 2007
    By Rutajit in forum Windows Software
    Replies: 3
    Last Post: 08-08-2009, 12:50 PM
  3. Replies: 3
    Last Post: 18-05-2009, 05:09 PM
  4. View HTML Source in a VB.net Program
    By Calast in forum Software Development
    Replies: 2
    Last Post: 09-04-2009, 08:03 AM
  5. How to retrieve Response from HTML and PHP pages
    By Sachit in forum Software Development
    Replies: 2
    Last Post: 02-02-2009, 11:11 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •