|
| ||||||||||
| Tags: html, java, programming language, scripting languages, source code |
![]() |
| | Thread Tools | Search this Thread |
|
#1
| |||
| |||
| Java program to retrieve html source
I made a java method that retrieves the source code of an HTML page for a given address. The problem is that it does not recover properly accents etc. Do you guys have any idea ? Quote:
Last edited by Aaliya Seth : 12-01-2010 at 10:49 AM. |
|
#2
| ||||
| ||||
| Re: Java program to retrieve html source
Hello, You must read the HTML specification: 1. If a META tag defines the encoding, it must be taken into account. 2. Otherwise, rely on the XML header (eventual). 3. Otherwise, rely on the HTTP header. 4. Otherwise, take the default encoding of the machine. You have to change the interpretation of your charset if you meet one of the cases. |
|
#3
| |||
| |||
| Re: Java program to retrieve html source
Hello, Thank you for the answer, at first I will try to comment with a static encoding have good characters. How to take into account the charset when you retrieve the InputStream, so by default I want to put UTF-8? Should I change my way back? The problem is that I have too much choice in passage through URLConnection. Otherwise, even as regards the recovery of dynamic encoding. For the moment, with uc.getHeaderField ( "Content-Type") I can receive meta content. For XML I see it too. By cons I put UTF-8 by default if the test meta tags and xml are not good. |
|
#4
| ||||
| ||||
| Re: Java program to retrieve html source
Heelo If you want it to be treated as UTF-8, it must send the page with UTF-8. For this, we must send a header with your server (header () function in PHP). And it requires in addition that your page is encoded in UTF-8 (Eclipse allows, and most text editors). For cons, I do not think you get the contents of the META tag with getHeaderField (). We must see the doc, but I'm pretty sure not. |
|
#5
| |||
| |||
| Re: Java program to retrieve html source
Hello, Code: System.out.System.out.println("Content-Type: + Uc.getHeaderField("Content-Type")); Quote:
Code: System.out.System.out.println((char)c); |
|
#6
| ||||
| ||||
| Re: Java program to retrieve html source
Hello, I think you have missed the Charset for the InputStreamReader, just view the code below. Code: try {
InputStreamReader inputstr = new InputStreamReader( in, encoding );
BufferedReader buffered = new BufferedReader( inputstr );
String line;
while ( ( line = buffered.readLine() ) != null )
textHtml + = line;
} finally {
in.close();
} |
![]() |
|
| Thread Tools | Search this Thread |
| |
Similar Threads for: "Java program to retrieve html source" | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Using PHP to extract Source IP from email header input via HTML form | TimB | Software Development | 1 | 08-06-2012 12:42 PM |
| How to View HTML Source Code in Word 2007 | Rutajit | Windows Software | 3 | 08-08-2009 12:50 PM |
| Is there a good Open Source Chat Room program that is running on JAVA? | Fernandoa | Software Development | 3 | 18-05-2009 05:09 PM |
| View HTML Source in a VB.net Program | Calast | Software Development | 2 | 09-04-2009 08:03 AM |
| How to retrieve Response from HTML and PHP pages | Sachit | Software Development | 2 | 02-02-2009 11:11 AM |