Computer Science 221: Assignment #03
Spring 2007
Department of Informatics
Donald Bren School of Information and Computer Sciences
University of California, Irvine
The
goal of this project is to compare the effectiveness of a markov model based classification technique to a latent semantic indexing classification technique.
The training data are the scripts from Assignment #02.
Data set is located here.
- This is an individual project. Your work should be done on your own.
- For both the markov model portion of the assignment and the LSI portion of the assignment use leave-one-out-cross-validation (LOOCV).
- Markov Model
- Train a markov model
on each genre of script
- Ignore punctuation. delete apostrophes and treat other punctuation as a space. Collapse multiple spaces into one space.
- Use a single character, second order Markov model with 1/N count smoothing (N is the number of states)
- Use the trained markov models to classify the left-out script
- Use sum of log-probability to avoid underflow
- LSI
- Using all of the data, create a term-document matrix
- Use frequency counts for one run
- Use some other count for a second run (z-scores?)
- Plot two graphs showing the drop in singular values
- Pick two points on each graph between 5-20 to use as your rank in the reduced rank approximations to the term-document matrix
- Use a bag of words model to create a term-document matrix for each training set
and classify the held out document according to angular distance from the other documents in concept space
- Use frequency counts and the two inflection points for the first run
- Use some other type of count and the two inflection points for the second run
- What to turn in
- Table 1
- Overall classification accuracy for the Markov Model technique
- Overall classification accuracy for the four LSI techniques
- Frequency Count
- Other Count
- A confusion matrix based on genre for:
- The Markov model technique
- The four LSI techniques
- For each of the four LSI techniques a description of the primary concept axis phrase in terms of the original terms:
- e.g., 5*"terminator" + 1.3*"kill" - 4.1*"laugh"
- Put your output into a single Word or .pdf document. Clearly label all of your output.
- Email the resulting document to me with the subject line "CS 221 Assignment 03 " by 11:59pm on the due date (see calendar)