The goal of this project is to experiment if it is possible to identify the author of some text by comparing it to texts whose authors are already known. In the project we will be using two collections of texts:
training_textsthat we will use as reference texts whose aurthors we already know.test_textswhose authors we will try to predict by comparing them totraining_texts.
Each of these collections is given below as a dictionary. Dictionary keys identify authors and shortened titles of texts, values are urls of text files.
base_url = "https://cdn.jsdelivr.net/gh/bbadzioch/mth337_site@main/projects/text_attribution/books/"
training_urls = {
"addams_democracy": base_url + "addams_democracy.txt",
"emerson_conduct": base_url + "emerson_conduct.txt",
"fuller_life": base_url + "fuller_life.txt",
"muir_boyhood": base_url + "muir_boyhood.txt",
"thoreau_journal": base_url + "thoreau_journal.txt"
}
test_urls = {
"addams_conscience": base_url + "addams_conscience.txt",
"addams_hull_house": base_url + "addams_hull_house.txt",
"addams_youth": base_url + "addams_youth.txt",
"emerson_english": base_url + "emerson_english.txt",
"emerson_nature": base_url + "emerson_nature.txt",
"emerson_representative_men": base_url + "emerson_representative_men.txt",
"fuller_europe": base_url + "fuller_europe.txt",
"fuller_summer": base_url + "fuller_summer.txt",
"fuller_woman": base_url + "fuller_woman.txt",
"muir_sierra": base_url + "muir_sierra.txt",
"muir_walk": base_url + "muir_walk.txt",
"muir_yosemite": base_url + "muir_yosemite.txt",
"thoreau_walden": base_url + "thoreau_walden.txt",
"thoreau_yankee": base_url + "thoreau_yankee.txt",
"thoreau_rivers": base_url + "thoreau_rivers.txt"
}Project¶
Compare test texts to training texts using three different methods:
By looking at the distribution of lengths of words in texts.
By identifying most common words in the training texts and looking at their frequencies.
By looking at frequencies of punctuation characters:
punctuation = ",-.;':!?()&"
Investigate how successful each if these methods is in identifying correct authors of training texts.
Here is a url to one more text file. The file was obtained by taking a book of one of the authors of training texts and rearranging its words in alphabetical order. Use the methods listed above to investigate who is the author of the scrambled text:
https://cdn.jsdelivr.net/gh/bbadzioch/mth337_site@main/projects/text_attribution/books/mystery.txt