The links
below point to just a few of the many data sets for text analysis that you can find on the
Web, and should help you in terms of finding data sets to work on for your projects. Note that these are just some examples of many publicly-available text datasets that are available - please feel free to use other datasets that you find (or create) beyond those listed below.
Text Classification and Sentiment Analysis
Multiple text classification datasets from NLP-progress
Multiple sentiment analysis datasets from NLP-progress
Yelp Data Set Challenge (8 million
reviews of businesses from over 1 million users across 10 cities)
Kaggle Data Sets with text content (Kaggle is a company that hosts machine learning competitions)
Labeled Twitter data sets from (1)
the SemEval 2018 Competition
and (2)
Sentiment 140 project
Amazon Product Review Data from UCSD. This is a very large and rich data set with review text, ratings, votes, product metdata, etc. The full dataset is extremely large - some of the smaller subsets provided may be better for class projects.
IMDB Moview Review Data with 50,000 movie reviews and binary sentiment labels
Well-known Movie review data for sentiment
analysis, from Pang and Lee, Cornell
Product review data from Johns
Hopkins University (goal is to predict ratings on scale of 1 to 5)
Dialog/Conversation/Chatbots
A repository of large datasets for models of conversational response
A survey paper on data sets available for building data-driven dialogue systems
Amazon Topical Chat Dataset with
accompanying research paper and blog post from Amazon.
ConvAI2 Competition Dataset
Multiple labeled dialog/chatbot datasets from NLP-progress
Cornell Movie-Dialogs Corpus
Transcripts from the TV series "The Office" (formatted for the R language)
Language
Models and Auto-complete Algorithms
Language modeling datasets from NLP-progress
Ngram data from Peter Norvig (Google), with an accompanying tutorial book chapter
Google ngrams, and Google syntactic
ngrams over time, from Google books
Question-Answering Datasets
Multiple question-answering datasets from NLP-progress
WikiQA , a data set for "open-domain"
question answering, from Microsoft Research
Question-Answering Data Sets from TREC (funding by
the National Institute of Standards and Technology, NIST)
Question Answering Corpus from DeepMind
The Allen AI Science
Challenge on Kaggle (competition ended in 2016)
Summarization
Multiple summarization datasets from NLP-progress
Other Interesting Text Data Sets (could be used for multiple different types of projects)
Enron email data set, from CMU
(note that there are
other "cleaner" versions available on the Web if you search...)
CMU Movie Summary Corpus
Book Summaries Corpus
Full text of US patents
from 1980 to 2015, from the USPTO (US Patent and Trademark Office), hosted by Google
(Could be used for example to detect trends and changes in patent language and concepts over time)
Very large data set of all Reddit submissions between
2006 and 2015
Data Sets on "learning to
rank" (for Web search, from Microsoft Research)
All
of Wikipedia (can be used to build classifiers using category labels or to provide
additional information for other models such as n-gram statistics)
Various text and Web-related data
sets from Yahoo! Labs (these data sets could be used for different tasks).
The DBpedia Data Set (an example of a large-scale ontology/knowledge-base)