FernUniversität Hagen

Fakultät für Mathematik und Informatik

Home

Lehre
Lehreveranstaltungen
Masterarbeiten und Bachelorarbeiten
   Offene Themen
   Reservierte Themen
   Laufende Themen
   Abgeschlossene Themen
   Hinweise für Studierende
Promotion
Mündliche Prüfungen
Klausuren

Forschung
Projekte
Forschungscluster
Publikationen und Literatursuche

Team

Stellenangebote für unsere Gruppe

Wir über uns (Kontaktadresse, Wegbeschreibung)
KONTAKT
Homepage
Neuigkeiten als RSS-Feed
ENGLISH
Startseite Lehrgebiet Multimedia und Internetanwendungen

Test your programming skills

[Zur Startseite]


Contents


The task

A frequent task in Information Retrieval (IR) is the calculation of term frequencies. For all terms it is to be counted how often they occur in a text. For this a term is defined as the stem of a word. Examples:

word -> word stem (term)
going -> go
apple -> appl
apples -> appl

Within this task the document in question (the first scene of Shakespeare's Hamlet) is in XML format. Therefore first that file must be downladed and parsed. After this only the contents of the <LINE> is to be taken, meaning everything enclosed in <LINE>...</LINE>. From this the term frequencies (after stemming) are to be calculated. The output of the programm is a list of all terms, together with the respective occurrence frequencies within <LINE> elements. The output should look like this:

word count
go 7
appl 2
situat 5

It is recommend to implement term counting on plain text first, and afterwards extend the programm towards XML parsing.

Feel free to fullfill this task in your favourite language. Further resources for solving the problem in the most important languages are below. Feel free to contact us when you have questions concerning this tasks. If you want to let us check your results, please send us the code and the output of your running program.


Hints for C++

Resources


Hints for Java

Parsing of XML can be done with Xerces. For word stem reduction there is a variant of the famous Porter Stemming Algorithm available. For counting the occurrence frequencies one can use java.util.StringTokenizer and java.util.Hashtable.

Resources


Hints for Perl

Resources