Chinese Webpage Analyzer
--------------------------------------------------------------------------------
develop a webpage analyzer that user could input an URL and the analyzer program could retrieve the web content and process it instantly.
Your system should be able to carry out:
1) Auto-detect the page character coding (Big5 / GB or UTF-8 , etc.)
2) For analyzing the page content, the program needs to convert this into Unicode (UTF-8). Based on the code range, the system could analyze the different types of characters and their statistics which includes the frequency of occurrence of characters and the character details according to character subset(internal code and the form of another coding).
3) Users can also specify the analyzed result according to query options:
i. Number of characters to be displayed (Most frequent character/ Least frequent character, say the most frequent 10 characters, least frequent 100 characters)
ii. Specific type: English, Chinese, or according to the specified code ranges. The code ranges can be identified according the subset definitions (which you can find in Microsoft Word Symbol tool under “Arial Unicode MS” font set.
Analyzer features: (Basic requirement)
1) Auto-detect content encoding :B5/GB/UTF-8
i) check source content(HTML)
<meta charset="????">
ii) (Bonus) Check character internal code range
2) Source Encoding ---convert--> UTF-8
3) Analysis character frequence of occurence
Result:
number of characters : XXX ,
? 504 55%
? 300 30%
...
...
4) Character Details:
?:
UTF code: EE FF
B5 code : CC DD
GB code : AA 33
?:
UTF code: EE F3
B5 code: DD CC
GB code?) : CC d3
(Bonus : B5 <-converstion-> GB)
======================================================
Anyone can give suggestions on some steps?