A frequent task in Information Retrieval (IR) is the calculation of term frequencies. For all terms it is to be counted how often they occur in a text. For this a term is defined as the stem of a word. Examples:
|word||->||word stem (term)|
Within this task the document in question (the first scene of Shakespeare's Hamlet) is
in XML format. Therefore first that file must be downladed and
parsed. After this only the contents of the
<LINE> is to be taken, meaning everything
this the term frequencies (after stemming) are to be
calculated. The output of the programm is a list of all terms,
together with the respective occurrence frequencies within
<LINE> elements. The output should look
It is recommend to implement term counting on plain text first, and afterwards extend the programm towards XML parsing.
Feel free to fullfill this task in your favourite language. Further resources for solving the problem in the most important languages are below. Feel free to contact us when you have questions concerning this tasks. If you want to let us check your results, please send us the code and the output of your running program.
Parsing of XML can be done with Xerces. For word stem reduction
there is a variant of the famous Porter Stemming Algorithm
available. For counting the occurrence frequencies one can use