This exercise brings together what we have learned about text analysis so far, and builds on some of the skills from Practice Exercise # 3.  You will be creating your own notebook from scratch, calling data, manipulating that data, and analyzing it.  Please submit the notebook, and the .csv files you created in parts 3 and 7 via this Lyceum link (section A) (section B) by the start of class on November 6th


Set up

  1. Create a new notebook. Name it DSC_104_PE4 (do not include your name – I will grade anonymously and then return your notebooks)
  2. Title your notebook. Make your first cell into a markdown cell (hit escape and then m) and then enter one hashtag – this makes the cell into a title cell.  Write a placeholder title (you’ll come back at the end and change the title). Run the cell.
  3. Title your sections. Repeat the steps above for seven new cells but use two hashtags (this creates smaller title text).  Name them: Analysis, Packages, Calling Data, Cleaning Data, Sentiment Analysis, and Topic Modeling

Installing Packages

  1. In the Packages section, use the install.packages() function and the library() function to install the following packages.  Make sure that each install.packages() function and each library() function are in different cells. (So you should have eight cells in total)
    1. tm
    2. tidyverse
    3. tidytext
    4. topicmodels

Calling Data

  1. In the Calling Data section, first create a code cell and run the code:
    sample(1:121, 3)

    The sample() function takes as inputs a range of numbers from which to randomly select values, and the number of values to select. We are using it here to randomly pick which oral histories you will work with in this assignment.

  2. Then, create a file called file_names, which contains a CSV located at .  This file contains a list of oral history files in which each row is a question or answer. One column contains the number of the file, the other contains the name of the file.
  3. Determine the names of the columns in file_names
  4. Determine the names of the interviews associated with the random numbers you selected. (This means using the $ to identify a column and [] to identify a row)
  5. Call each interview and store it in a different variable.  The interview .csv files are in the folder  So if I were looking for a file called “Shrout_Anelise.csv”, I would run
    Shrout <- read.csv(, header = TRUE)

Cleaning Data

  1. Consider your first interview.  Determine the names of the columns in the dataframe.
  2. Use summary() to look at only the column that references which person is speaking
  3. Create a new variable that hold only the answers (not the questions) from the interview, and use which to remove all of the rows which are questions asked by the interviewer.  You might need to look at the data to do this.
  4. Check to make sure you have kept the answers.
  5. Adapt the code from the Cleaning the data section of the Intro to Topic Modeling notebook to clean each row of this new dataframe. This will require the creation of a variable that holds the “clean” version of your text (e.g. Shrout_clean), a variable that contains a new cleaned row of text, and the adaptation of a for-loop.
  6. Add a line in your for-loop that creates a new cell in a new column (call it something like answer), and fill that new cell with the number of the iteration of the loop. (This means that each row of cleaned text will have a number associated with it).
  7. Write the resulting cleaned interview to a .csv file with the same name as the variable.  Remember to set row.names to false.
  8. Repeat the data cleaning for each of your interviews.

Sentiment Analysis

  1. Adapt the code from the From Text to Words section of the Intro to Text Analysis notebook to extract all of the words in the first row of your cleaned first interview
  2. Use the function to convert your list into a dataframe
  3. Create a variable containing all of the afinn sentiment words (the Intro to Text Analysis notebook will help you here)
  4. Merge the dataframe containing all of the words with the dataframe containing all of the sentiments.
  5. Enter the score column of the resulting dataframe in sum() to get the sum of sentiment scores. (You might get 0)
  6. Add that score to the data frame containing your first interview.  Remember that you can add a new column to a dataframe simply by using $ and the name of your new column.  Remember that you specify a row in a dataframe’s column by using []
  7. Write a for-loop that does steps 1-6 for each row of your cleaned first interview.
  8. Run a similar for-loop for your other two interviews.

Topic Modeling – Topics

  1. Here, we are going to replicate and build on the topic modeling notebook.  Start by creating a new variable to read the lines of your transcript into (if I were looking at an interview called “Shrout_Anelise.csv” I would call this variable Shrout_lines).  Enter in the parentheses the function readLines(), and use the name of the first .csv file that you created in the Cleaning Data section of this notebook.
  2. Create a corupus variable (e.g. Shrout_corpus) and the nested functions Corpus(VectorSource()).  Enter in the parentheses the lines variable you created above.
  3. Use that same corpus variable, and feed into it the function tm_map() in order to get the “stems” of words in your interview. Use the Calling and formatting data for topic modeling section of the Intro to Topic Modeling notebook as a template.
  4. Create a DTM (document term matrix) variable (e.g. Shrout_DTM) and use it to hold the output of the DocumentTermMatrix() function. As input, use the corpus variable you created and refined above.
  5. Create a topics variable (e.g. Shrout_topics) to hold the output of the LDA() function, using the same format as in the Calling and formatting data for topic modeling section of the Intro to Topic Modeling notebook.
  6. (This is new).  Use the function terms() to see the terms associated with each topic.  The function terms takes as inputs the topics variable you created above, and the number of terms you want to see.  Start with 5. For example, if I were looking at Shrout_topics, I would use:
    terms(Shrout_topics, 5)
  7. Create a new terms variable (e.g. Shrout_terms) to store that term list.
  8. Use matrix() to convert that term list to a matrix. For example, if I was converting Shrout_terms, I would use:
    Shrout_terms <- as.matrix(Shrout_terms)
  9. Use the t() function to transpose your newly created dataframe.  This makes rows into columns and columns into rows.  We are doing this because the output of terms() is five columns, with each row of each column a word belonging to the topic.  We want each topic to be its own row.
  10. Then use to convert the term list to a dataframe, for example:
    Shrout_terms <-
  11. Create a new row in this terms dataframe that is called “topic”. Then fill this whole row with 0s.  (You can do this by specifying the dataframe$column, and using <- 0)
  12. Write a for-loop that writes 1 into the first row of this new column, 2 into the second row, 3 into the third row, 4 into the fourth row and 5 into the fifth row.
  13. Repeat these steps for each of your other two interviews. Look at each set of topic words when you finish.
  14. In the Analysis section you created at the start of this assignment, create a sub-section for each interview. In each sub-section, write a few sentences describing what you think the interview is about.  Then make a numbered list of each topic for each interview, and describe what you think each topic signifies.

Topic Modeling – Merging topics, terms and interviews

  1. We have a list of words that goes with each topic.  Now we need to know which topic is most prominent in each row (answer) of each interview.  Start with your first interview. First we use the function topics() to list all of the topics associated with each row of our interview dataframe.  As an input, use the topics variable you created to hold the output of LDA(). If I were working with the Shrout interview, I would use:
  2. Store the output from the function above in an answers variable (e.g. Shrout_topics_answers). Then convert that variable into a dataframe.
  3. Create a new row in this topics dataframe that is called “answer”. Then fill this whole row with 0s.  (You can do this by specifying the dataframe$column, and using <- 0)
  4. Write a for-loop that writes 1 into the first row of this new column, 2 into the second row, 3 into the third row, 4 into the fourth row and 5 into the fifth row etc.
  5. Look at the names of the columns of the terms variable and the answers variable you just created. Then create a new variable that contains the merge of these two variables.  The information that these two dataframes share in common is the topic number.
  6. Check the resulting dataframe.
  7. Then, merge the dataframe you created with your the “clean” interview dataframe you created in the Cleaning Data section of this assignment.  The information that these two dataframes share in common is the answer number.
  8. Write the resulting dataframe to a new .csv file (call it something new – e.g. Shrout_sentiment_topic.csv)
  9. Repeat the sentiment analysis and merging for your remaining two interviews.  You do not have to copy all of the code – just copy the cells that are doing new work and change the interview names.


  1. Download each of the sentiment/topic .csv files you just created.
  2. Look over each, and see if any particular topics or clusters of topics is associated with high or low sentiment scores.
  3. In the Analysis section, write one paragraph for each interview describing your findings.  You must reference at least three readings in the Analysis section.

Leave a Reply

Your email address will not be published. Required fields are marked *