Below is a list of final year projects for ODL students at Queen Mary,
University of London for the academic year 2005/2006. Of course, I'm
also open to your suggestions, as long as they are related to projects
listed below. Please only chose a project with a high difficulty level
if you are very confident that you can accomplish this task.
Natural Language Processing Projects
-
Project 1: Proper name translation from Chinese into English
Description: Translating proper names from Chinese into
English is a notorious problem in Machine Translation, because
names are often not included in a translation
dictionary. Especially foreign names (i.e. non-Chinese) form a
problem because they are based on phonetic approximation. One way
to gather these translation is by crawling the web, especially
Wikipedia (see link below). For example, in this web page the
English proper names are folled by the actual Chinese names.
Comment: this project should focus on simplified Chinese.
Expected outcome:
- Implement a system that extracts the proper Chinese names and their translations into
English from the Wikipedia web
site, or any other web site you consider appropriate.
- The system has to detect English names and the corresponding Chinese name using
pattern matching (the patterns have to be implemented by you).
Recommended prerequisites: Curiosity to work with real data,
and some programming skills.
Difficulty level: medium
-
Project 2: Inducing Comparable Corpora for Statistical
Machine Translation
Description: Statistical machine translation uses parallel
corpora to estimate translation probabilities between two
languages. A parallel corpus is a list of sentence pairs of the
form <sentence n in language A, sentence m in language B>,
where the n is a translation of m, and vice versa. Having a long
list of parallel sentences allows one to generalize and find
proper word and phrase translations. Sometimes parallel corpora a
produced by nations that have more than one official language
(e.g. Canada), or by news agencies. However, producing such a
parallel corpus is expensive and often very domain-specific,
e.g. discussions of the EU parlament. On the other hand, many news
agencies, like the BBC, have their news in many languages and
although there are often not translation of each other, they share
a lot of information that can be used for estimating translations.
The task of this project is to develop a program that identifies
news stories from different languages that are on the same topic
and extract sentences/passages that are corresponding translations.
Expected outcome:
- A program that identifies
news stories from different languages that are on the same topic
and extract sentences/passages that are corresponding translations.
- Evaluation of the built comparable corpus with an existing
machine translation system.
Recommended prerequisites: Some very basic background in probability theory (or the
willingness to obtain this background); programming skills.
Difficulty level: high
Pointers:
-
Project 3: Speaker turn detection
Description: The problem is to detect when in a dialog
situation, speaker A stops and speaker B starts speaking. Textual
clues help speaker identifiaction systmes, dialog analysis systems
to differiantiate between different speakers.
Expected outcome:
- Implement a system that learns keyphrases that indicate a
speaker turn. You can do that in a supervised fashion
using data that is already annotated with speaker turns,
like the transcripts from talk shows such as CNN's Larry
King Live, Crossfire,
The
Capital Gang, or many other CNN shows. All transcripts
provided by CNN can be accessed here. The
idea is to find clue words or phrases that indicate a
speaker turn.
Recommended prerequisites: Curiosity to work with real data,
and some programming skills.
Difficulty level: medium
Project 4: Anaphora resolution (locations in Chinese)
Description: It is common in Chinese to use the first character of a country name
as abbreviation for the entire name. A country alias (or anaphora) system would take the
first character of the country name and then look for it elsewhere in the text.
Expected outcome: - Implement a system that
resolves these anaphora to the most likely candidate (the full
name) in a new document, i.e. a document that was not used for
training the system. Given an input text, the system should output
the same text, butadd to each occurrence of an anphora the entity
(location) to which it refers.
Recommended prerequisites: Curiosity to work with real data,
and some programming skills.
Difficulty level: medium
Pointers:
- For this project you should use large collections of newswire article,
like Xinhua