Go Back   TechArena Community > Software > Software Development
Become a Member!
Forgot your username/password?
Register Tags Active Topics RSS Search Mark Forums Read

Sponsored Links



Java program to retrieve html source

Software Development


Reply
 
Thread Tools Search this Thread
  #1  
Old 12-01-2010
Member
 
Join Date: Dec 2009
Posts: 213
Java program to retrieve html source
  

Hello,
I made a java method that retrieves the source code of an HTML page for a given address. The problem is that it does not recover properly accents etc. Do you guys have any idea ?
Quote:
URL loc = new URL(address);
URLConnection urlcon = loc.openConnection();
InputStream in = urlcon.getInputStream();
int c = in.read();
StringBuilder build = new StringBuilder();
while (c! = -1) {
build.append((char) c);
c = in.read();
}
toreturn = build.toString();


Last edited by Aaliya Seth : 12-01-2010 at 10:49 AM.
Reply With Quote
  #2  
Old 12-01-2010
Member
 
Join Date: Jan 2008
Posts: 1,515
Re: Java program to retrieve html source

Hello,
You must read the HTML specification:
1. If a META tag defines the encoding, it must be taken into account.
2. Otherwise, rely on the XML header (eventual).
3. Otherwise, rely on the HTTP header.
4. Otherwise, take the default encoding of the machine.
You have to change the interpretation of your charset if you meet one of the cases.
Reply With Quote
  #3  
Old 12-01-2010
Member
 
Join Date: Dec 2009
Posts: 213
Re: Java program to retrieve html source

Hello,
Thank you for the answer, at first I will try to comment with a static encoding have good characters. How to take into account the charset when you retrieve the InputStream, so by default I want to put UTF-8? Should I change my way back? The problem is that I have too much choice in passage through URLConnection. Otherwise, even as regards the recovery of dynamic encoding. For the moment, with uc.getHeaderField ( "Content-Type") I can receive meta content. For XML I see it too. By cons I put UTF-8 by default if the test meta tags and xml are not good.
Reply With Quote
  #4  
Old 12-01-2010
Member
 
Join Date: Jan 2008
Posts: 1,515
Re: Java program to retrieve html source

Heelo
If you want it to be treated as UTF-8, it must send the page with UTF-8. For this, we must send a header with your server (header () function in PHP). And it requires in addition that your page is encoded in UTF-8 (Eclipse allows, and most text editors). For cons, I do not think you get the contents of the META tag with getHeaderField (). We must see the doc, but I'm pretty sure not.
Reply With Quote
  #5  
Old 12-01-2010
Member
 
Join Date: Dec 2009
Posts: 213
Re: Java program to retrieve html source

Hello,
Code:
System.out.System.out.println("Content-Type: + Uc.getHeaderField("Content-Type"));
I returned:
Quote:
text / html; charset = UTF-8
Perhaps the meta tags are malformed because getContentEncoding () returns me null. Otherwise, my HTML page I send much I set UTF-8 in the encoding, but not that is the problem but before actually. Just when I get the page content, I stock the bad characters good StringBuffer. To verify I add:
Code:
System.out.System.out.println((char)c);
Reply With Quote
  #6  
Old 12-01-2010
Member
 
Join Date: May 2008
Posts: 2,290
Re: Java program to retrieve html source

Hello,
I think you have missed the Charset for the InputStreamReader, just view the code below.
Code:
try {
  InputStreamReader inputstr = new InputStreamReader( in, encoding );
  BufferedReader buffered = new BufferedReader( inputstr );
  String line;
  while ( ( line = buffered.readLine() ) != null ) 
    textHtml + = line;		    
} finally {
  in.close();
}
Reply With Quote
Reply

  TechArena Community > Software > Software Development
Tags: , , , ,



Thread Tools Search this Thread
Search this Thread:

Advanced Search


Similar Threads for: "Java program to retrieve html source"
Thread Thread Starter Forum Replies Last Post
Using PHP to extract Source IP from email header input via HTML form TimB Software Development 1 08-06-2012 12:42 PM
How to View HTML Source Code in Word 2007 Rutajit Windows Software 3 08-08-2009 12:50 PM
Is there a good Open Source Chat Room program that is running on JAVA? Fernandoa Software Development 3 18-05-2009 05:09 PM
View HTML Source in a VB.net Program Calast Software Development 2 09-04-2009 08:03 AM
How to retrieve Response from HTML and PHP pages Sachit Software Development 2 02-02-2009 11:11 AM


All times are GMT +5.5. The time now is 03:15 PM.