Java program to retrieve html source
Hello,
I made a java method that retrieves the source code of an HTML page for a given address. The problem is that it does not recover properly accents etc. Do you guys have any idea ?
Quote:
URL loc = new URL(address);
URLConnection urlcon = loc.openConnection();
InputStream in = urlcon.getInputStream();
int c = in.read();
StringBuilder build = new StringBuilder();
while (c! = -1) {
build.append((char) c);
c = in.read();
}
toreturn = build.toString();
Re: Java program to retrieve html source
Hello,
You must read the HTML specification:
1. If a META tag defines the encoding, it must be taken into account.
2. Otherwise, rely on the XML header (eventual).
3. Otherwise, rely on the HTTP header.
4. Otherwise, take the default encoding of the machine.
You have to change the interpretation of your charset if you meet one of the cases.
Re: Java program to retrieve html source
Hello,
Thank you for the answer, at first I will try to comment with a static encoding have good characters. How to take into account the charset when you retrieve the InputStream, so by default I want to put UTF-8? Should I change my way back? The problem is that I have too much choice in passage through URLConnection. Otherwise, even as regards the recovery of dynamic encoding. For the moment, with uc.getHeaderField ( "Content-Type") I can receive meta content. For XML I see it too. By cons I put UTF-8 by default if the test meta tags and xml are not good.
Re: Java program to retrieve html source
Heelo
If you want it to be treated as UTF-8, it must send the page with UTF-8. For this, we must send a header with your server (header () function in PHP). And it requires in addition that your page is encoded in UTF-8 (Eclipse allows, and most text editors). For cons, I do not think you get the contents of the META tag with getHeaderField (). We must see the doc, but I'm pretty sure not.
Re: Java program to retrieve html source
Hello,
Code:
System.out.System.out.println("Content-Type: + Uc.getHeaderField("Content-Type"));
I returned:
Quote:
text / html; charset = UTF-8
Perhaps the meta tags are malformed because getContentEncoding () returns me null. Otherwise, my HTML page I send much I set UTF-8 in the encoding, but not that is the problem but before actually. Just when I get the page content, I stock the bad characters good StringBuffer. To verify I add:
Code:
System.out.System.out.println((char)c);
Re: Java program to retrieve html source
Hello,
I think you have missed the Charset for the InputStreamReader, just view the code below.
Code:
try {
InputStreamReader inputstr = new InputStreamReader( in, encoding );
BufferedReader buffered = new BufferedReader( inputstr );
String line;
while ( ( line = buffered.readLine() ) != null )
textHtml + = line;
} finally {
in.close();
}