Scaffolding # 5 – Final Project Idea

This is your opportunity to select, and set the parameters for, your final project.  Work through this in the following order:

  1. Review the comments you received from me on your Scaffolding # 4 assignment
  2. Look over the data that we have been working with (consult the README and also look at the data tables themselves)
  3. Think about the kind of work that will be required to complete the final project (full project description here)
  4. In light of the restrictions of the data, your time, and your interests, pick one of the ideas you submitted as part of Scaffolding # 4
  5. Revise your question, and your discussion of the data cleaning and additional research you will have to undertake.
  6. Submit your revised idea (section A)(section B)

Practice Exercise # 6 – Data Visualization

This exercise brings together what we have learned about data visualization.  You will be creating your own notebook from scratch, calling data, manipulating that data, and visualizing it.  Please submit the  notebook create via this Lyceum link (section A) (section B) by the start of class on November 27th

Steps:

Review Best Practices. As a class, we came up with the following best-practices for data visualization.  Read over them, and make sure that you have a good sense of what they would mean for your own visualizations:

  • Acknowledge the source of your data
  • Understand the story behind your data
  • Understand what you’re looking for
  • Make sure the visualization is clear
  • Consider your audience
  • Good labels, axis titles, colors
  • Use trend lines appropriately
  • Make sure that your data is sufficient and representative data
  • Makes sure your visualization is accessible
  • Make sure your visualization is interactive (where appropriate)
  • Makes sure that you use frameworks that your audience is familiar with

Set up

  1. Create a new notebook.  Name it PE6 (do not include your name).  You will be downloading, zipping, and uploading this folder.
  2. Title your notebook. Make your first cell into a markdown cell (hit escape and then m) and then enter one hashtag – this makes the cell into a title cell.  Write a placeholder title (you’ll come back at the end and change the title). Run the cell.
  3. Title your sections. Repeat the steps above for seven new cells but use two hashtags (this creates smaller title text).  Name them: AnalysisPackages, Calling Data, First Viz, Second Viz

Packages

  1. Load the package “ggplot2” by running library(“ggplot2).  YOU DO NOT NEED TO RUN INSTALL PACKAGES.

Calling Data

  1. Call in each of the tables we have been working with: person, work, family, factory.  Save each in a different variable.

First Viz.

  1. Take some time to re-familiarize yourself with this data.  Look at the README if you need to.
  2. Take some time to familiarize yourself with the kinds of visualizations we created in the Challenges of Data Viz notebook
  3. Come up with ONE question about the relationship between two numerical columns and ONE categorical column from TWO different tables.
  4. Merge those tables.
  5. Remove the NAs from the columns you are interested in.
  6. Consider the best practices for dataviz outlined above, and make a visualization, using one of the methods we used in the Challenges of Data Viz notebook.  You will be plotting the two numerical columns, and color-coding according to the categorical column.
  7. Title your first viz section (something other than first viz)
  8. Write a paragraph outlining what the visualization is meant to show, and how you designed it in keeping with best practices in your analysis section.

Second Viz

  1. Look at the additional types of visualizations available in ggplot2 using this ggplot2 cheat sheet
  2. Pick a visualization that we have not used already.
  3. Using different variables, repeat the steps above.

Analysis

  1. Write an introduction that explains any insights that you gained from your data, and any new questions that your visualizations raised.
  2. Illustrate your insights using your images.
  3. Create a witty and/or informative title for your notebook.

Final Project

Your final project will be a public-facing website.  You should build it on a subdomain different from that which you use for you class blog.  You should submit your website to me, via e-mail by the end of finals week.  Your website should include the following components:

  • The questions you sought to answer about your data.
  • Discussion of how the data was produced (both the collaborative work done by the class, and the cleaning and additional research that you undertook) (~250 words + at least 3 citations)
  • Any context that you think your viewer will need to understand the data (~250 words + at least 1 citation)
  • Ethical concerns you grappled with and addressed while working with the data (~250 words + at least 3 citations)
  • Analysis – this should include the code that you wrote in the form of an embedded notebook, and a discussion of why the methods (Code + ~250 words + at least 3 citations)

Practice Exercise # 5 – Networks

This exercise brings together what we have learned about networks so far, and builds on some of the skills from Practice Exercise # 3.  You will be creating your own notebook from scratch, calling data, manipulating that data, and analyzing it.  Please submit the zipped folder containing the notebook and the 7 .png images you create via this Lyceum link (section A) (section B) by the start of class on November 13th

Steps:

Set up

  1. Create a new folder.  Name it PE5 (do not include your name).  You will be downloading, zipping, and uploading this folder.
  2. Create a new notebook in this new folder. Name it DSC_104_PE5 (do not include your name – I will grade anonymously and then return your notebooks)
  3. Title your notebook. Make your first cell into a markdown cell (hit escape and then m) and then enter one hashtag – this makes the cell into a title cell.  Write a placeholder title (you’ll come back at the end and change the title). Run the cell.
  4. Title your sections. Repeat the steps above for seven new cells but use two hashtags (this creates smaller title text).  Name them: Analysis, Packages, Calling Data, Constructing Networks, Analyzing Networks, Visualizing Networks, Text networks

Installing Packages

  1. Install the package “igraph”
  2. Load the package “igraph”

Calling Data

  1. Download the .csv file people_family.csv, and enter a 1 in each cell where the last name of the person in the row matches the last name in the column.  (This assumes that families are related by both marriage and birth)
  2. Save that file and upload it to your notebook environment
  3. Load people_family.csv as a matrix
  4. Load people_work.csv as a matrix (use the link from the Network Analysis notebook)
  5. Load the attributes data file

Constructing Networks

  1. Convert each matrix into a matrix that shows connections between people (use the code from the Network Analysis notebook)
  2. Create a new variable, and add the matrixes together (just use a +)
  3. Set the diagonals of this new matrix to NA
  4. Create a network graph object based on this new matrix
  5. Set the diagonals of your people_work matrix to NA
  6. Create a network graph object based on your people_work matrix
  7. Set the diagonals of your people_family matrix to NA
  8. Create a network graph object based on your people_family matrix

Analyzing Networks

  1. Calculate the betweenness for each network. Make sure that you assign names to your statistics using V().
  2. Make note of differences among your three networks
  3. Calculate the centrality for each network.  Make sure that you assign names to your statistics using V().
  4. Make note of differences among your three networks.

Visualizing Networks

  1. Visualize each network (use the code from the Network Analysis notebook).  Save each into a different .png file
  2. Look at each network, make note of differences
  3. Add color according to country of origin. Instead of using names for colors, go to http://colorbrewer2.org/, set the number of data classes to 3 (one for each country – USA, Canada and Greece) and set the nature of your data to qualitative.  Pick one of the color schemes that you like, and enter the hex code (it will look something like #7fc97f) in place of “red”, “blue” and “green” in the Network Analysis notebook.
  4. We are going to add one more dimension, which is strength of connection.  Create a new variable that will capture the strength of the relationship between two nodes (in network parlance, this is known as edge.weight).  Use the code below, but put the name of the graph object that holds the network that shows both family and work connections:
    YOURNEWVARIABLE <- get.edge.attribute(GRAPH-REPLACE-THIS-TEXT, "weight)
  5. Now, run the code that makes your network visualization again, but replace
    edge.width = 0.25

    with

    edge.width = YOURNEWVARIABLE
  6. Make note of any changes.

Cleaning your notebook

  1. Before you submit your notebook, go back and delete cells with extraneous information or attempts that did not work.
  2. If you feel like you need to explain decisions or flag problem points, do so now.

Analysis 

  1. Write an introduction that explains any insights that you gained from your data, and any new questions that your visualizations raised.  Illustrate your insights using your images.  You can insert images into your notebook using the following syntax (replace image.png with the file name of the image you want to insert):
    ![the title for your image](image.png)
  2. Create a witty and/or informative title for your notebook.

Practice Exercise # 4 – Text Analysis

This exercise brings together what we have learned about text analysis so far, and builds on some of the skills from Practice Exercise # 3.  You will be creating your own notebook from scratch, calling data, manipulating that data, and analyzing it.  Please submit the notebook, and the .csv files you created in parts 3 and 7 via this Lyceum link (section A) (section B) by the start of class on November 6th

Steps:

Set up

  1. Create a new notebook. Name it DSC_104_PE4 (do not include your name – I will grade anonymously and then return your notebooks)
  2. Title your notebook. Make your first cell into a markdown cell (hit escape and then m) and then enter one hashtag – this makes the cell into a title cell.  Write a placeholder title (you’ll come back at the end and change the title). Run the cell.
  3. Title your sections. Repeat the steps above for seven new cells but use two hashtags (this creates smaller title text).  Name them: Analysis, Packages, Calling Data, Cleaning Data, Sentiment Analysis, and Topic Modeling

Installing Packages

  1. In the Packages section, use the install.packages() function and the library() function to install the following packages.  Make sure that each install.packages() function and each library() function are in different cells. (So you should have eight cells in total)
    1. tm
    2. tidyverse
    3. tidytext
    4. topicmodels

Calling Data

  1. In the Calling Data section, first create a code cell and run the code:
    sample(1:121, 3)

    The sample() function takes as inputs a range of numbers from which to randomly select values, and the number of values to select. We are using it here to randomly pick which oral histories you will work with in this assignment.

  2. Then, create a file called file_names, which contains a CSV located at http://shroutdocs.org/notebooks/DCS104_f2018/file_names.csv .  This file contains a list of oral history files in which each row is a question or answer. One column contains the number of the file, the other contains the name of the file.
  3. Determine the names of the columns in file_names
  4. Determine the names of the interviews associated with the random numbers you selected. (This means using the $ to identify a column and [] to identify a row)
  5. Call each interview and store it in a different variable.  The interview .csv files are in the folder http://shroutdocs.org/notebooks/DCS104_f2018/All_csv_cleaned/.  So if I were looking for a file called “Shrout_Anelise.csv”, I would run
    Shrout <- read.csv(http://shroutdocs.org/notebooks/DCS104_f2018/All_csv_cleaned/Shrout_Anelise.csv, header = TRUE)

Cleaning Data

  1. Consider your first interview.  Determine the names of the columns in the dataframe.
  2. Use summary() to look at only the column that references which person is speaking
  3. Create a new variable that hold only the answers (not the questions) from the interview, and use which to remove all of the rows which are questions asked by the interviewer.  You might need to look at the data to do this.
  4. Check to make sure you have kept the answers.
  5. Adapt the code from the Cleaning the data section of the Intro to Topic Modeling notebook to clean each row of this new dataframe. This will require the creation of a variable that holds the “clean” version of your text (e.g. Shrout_clean), a variable that contains a new cleaned row of text, and the adaptation of a for-loop.
  6. Add a line in your for-loop that creates a new cell in a new column (call it something like answer), and fill that new cell with the number of the iteration of the loop. (This means that each row of cleaned text will have a number associated with it).
  7. Write the resulting cleaned interview to a .csv file with the same name as the variable.  Remember to set row.names to false.
  8. Repeat the data cleaning for each of your interviews.

Sentiment Analysis

  1. Adapt the code from the From Text to Words section of the Intro to Text Analysis notebook to extract all of the words in the first row of your cleaned first interview
  2. Use the function as.data.frame() to convert your list into a dataframe
  3. Create a variable containing all of the afinn sentiment words (the Intro to Text Analysis notebook will help you here)
  4. Merge the dataframe containing all of the words with the dataframe containing all of the sentiments.
  5. Enter the score column of the resulting dataframe in sum() to get the sum of sentiment scores. (You might get 0)
  6. Add that score to the data frame containing your first interview.  Remember that you can add a new column to a dataframe simply by using $ and the name of your new column.  Remember that you specify a row in a dataframe’s column by using []
  7. Write a for-loop that does steps 1-6 for each row of your cleaned first interview.
  8. Run a similar for-loop for your other two interviews.

Topic Modeling – Topics

  1. Here, we are going to replicate and build on the topic modeling notebook.  Start by creating a new variable to read the lines of your transcript into (if I were looking at an interview called “Shrout_Anelise.csv” I would call this variable Shrout_lines).  Enter in the parentheses the function readLines(), and use the name of the first .csv file that you created in the Cleaning Data section of this notebook.
  2. Create a corupus variable (e.g. Shrout_corpus) and the nested functions Corpus(VectorSource()).  Enter in the parentheses the lines variable you created above.
  3. Use that same corpus variable, and feed into it the function tm_map() in order to get the “stems” of words in your interview. Use the Calling and formatting data for topic modeling section of the Intro to Topic Modeling notebook as a template.
  4. Create a DTM (document term matrix) variable (e.g. Shrout_DTM) and use it to hold the output of the DocumentTermMatrix() function. As input, use the corpus variable you created and refined above.
  5. Create a topics variable (e.g. Shrout_topics) to hold the output of the LDA() function, using the same format as in the Calling and formatting data for topic modeling section of the Intro to Topic Modeling notebook.
  6. (This is new).  Use the function terms() to see the terms associated with each topic.  The function terms takes as inputs the topics variable you created above, and the number of terms you want to see.  Start with 5. For example, if I were looking at Shrout_topics, I would use:
    terms(Shrout_topics, 5)
  7. Create a new terms variable (e.g. Shrout_terms) to store that term list.
  8. Use matrix() to convert that term list to a matrix. For example, if I was converting Shrout_terms, I would use:
    Shrout_terms <- as.matrix(Shrout_terms)
  9. Use the t() function to transpose your newly created dataframe.  This makes rows into columns and columns into rows.  We are doing this because the output of terms() is five columns, with each row of each column a word belonging to the topic.  We want each topic to be its own row.
  10. Then use as.data.frame() to convert the term list to a dataframe, for example:
    Shrout_terms <- as.data.frame(Shrout_terms)
  11. Create a new row in this terms dataframe that is called “topic”. Then fill this whole row with 0s.  (You can do this by specifying the dataframe$column, and using <- 0)
  12. Write a for-loop that writes 1 into the first row of this new column, 2 into the second row, 3 into the third row, 4 into the fourth row and 5 into the fifth row.
  13. Repeat these steps for each of your other two interviews. Look at each set of topic words when you finish.
  14. In the Analysis section you created at the start of this assignment, create a sub-section for each interview. In each sub-section, write a few sentences describing what you think the interview is about.  Then make a numbered list of each topic for each interview, and describe what you think each topic signifies.

Topic Modeling – Merging topics, terms and interviews

  1. We have a list of words that goes with each topic.  Now we need to know which topic is most prominent in each row (answer) of each interview.  Start with your first interview. First we use the function topics() to list all of the topics associated with each row of our interview dataframe.  As an input, use the topics variable you created to hold the output of LDA(). If I were working with the Shrout interview, I would use:
    topics(Shrout_topics)
  2. Store the output from the function above in an answers variable (e.g. Shrout_topics_answers). Then convert that variable into a dataframe.
  3. Create a new row in this topics dataframe that is called “answer”. Then fill this whole row with 0s.  (You can do this by specifying the dataframe$column, and using <- 0)
  4. Write a for-loop that writes 1 into the first row of this new column, 2 into the second row, 3 into the third row, 4 into the fourth row and 5 into the fifth row etc.
  5. Look at the names of the columns of the terms variable and the answers variable you just created. Then create a new variable that contains the merge of these two variables.  The information that these two dataframes share in common is the topic number.
  6. Check the resulting dataframe.
  7. Then, merge the dataframe you created with your the “clean” interview dataframe you created in the Cleaning Data section of this assignment.  The information that these two dataframes share in common is the answer number.
  8. Write the resulting dataframe to a new .csv file (call it something new – e.g. Shrout_sentiment_topic.csv)
  9. Repeat the sentiment analysis and merging for your remaining two interviews.  You do not have to copy all of the code – just copy the cells that are doing new work and change the interview names.

Analysis

  1. Download each of the sentiment/topic .csv files you just created.
  2. Look over each, and see if any particular topics or clusters of topics is associated with high or low sentiment scores.
  3. In the Analysis section, write one paragraph for each interview describing your findings.  You must reference at least three readings in the Analysis section.

Scaffold #4 – Final Project Ideas

This assignment moves us towards the final project.

First, come up with three questions about the relationship between different aspects of our data.  For example: What was the relationship between an interviewee’s national background and the positivity with which they spoke about their work in Lewiston?

 

Then, for each question, identify:

  • At least two tools or methods that we have covered so far that will help you to explore this question
  • One additional kind of data that you would require to satisfactorily answer this question.
  • One aspect of the data that we already have that would need to be better cleaned or systemetized to satisfactorily answer this question.

If you have identified questions that do not require additional data and cleaning, refine and expand your questions.

 

DO NOT include your name or identifying information in your submission. I will grade anonymously and then release the grades.

Three ideas are due by the start of class on October 30th.  Submit here: (Section A) (Section B)

Practice Exercise #3 – Quantification

This exercise brings together what we have learned about quantification so far.  You will be creating your own notebook from scratch, calling data, manipulating that data, and analyzing it.  Please submit the notebook via this Lyceum link (section A) (section B) by the start of class on October 25th.

Note: Since you all are cleaning the data, you will not be able to start this lab until 5 PM on Tuesday the 16th.

Steps:

  1. Set up
    1. Create a new notebook. Name it DSC_104_PE3 (do not include your name – I will grade anonymously and then return your notebooks)
    2. Title your notebook. Make your first cell into a markdown cell (hit escape and then m) and then enter one hashtag – this makes the cell into a title cell.  (For more on markdown syntax, look here) Write a placeholder title (you’ll come back at the end and change the title). Run the cell.
    3. Title your sections. Repeat the steps above for seven new cells but use two hashtags (this creates smaller title text).  Name them: Introduction, Calling Data, Exploratory Statistics, Correlation, Categorical Variables, and Regression.
    4. Look over the README I created for our data.  It describes each table, gives the URL that holds that table, and describes each column in the table.
  2. Calling Data
    1. Use the function library(“psych”) to load the psych library of functions.
    2. In the Calling Data section, name variables and import each .csv mentioned in the README.  Use the read.csv() function, which takes the path to the file (in this case, the URL of the file) and header = TRUE (so that we have column titles)
  3. Exploratory Statistics
    1. In the Exploratory Statistics section, create a markdown cell, and write up two questions that you think this data can answer.  These should be about the relationship between different columns the data (i.e. What is the relationship between date of birth and country of origin? – but don’t use that one, since we’ve already covered it in practice.)  The first one should ask a question about the relationship between columns from different tables. The second one should ask about how three or more columns relate to each other.
    2. Create another cell in the Exploratory Statistics section and use summary() and class() to gather information about each of the columns you mentioned in your question. Remember that you will want to identify a variable and a column in that variable using the syntax variable$column
    3. Create a final markdown cell in the Exploratory Statistics section and describe any insights.
  4. Correlation
    1. Look at your questions, and the tables they reference.  Make a plan for merging those tables.  Remember that the merge() function takes in the two data frames that are being merged, as well as the column in each that contains the information that bridges both.  Create a markdown cell in the Correlation section and write up your merging plan.
    2. In a cell below the one you just created, merge the tables that you identified.  Make sure to create new variables to hold those merged tables. Create a new cell to view your tables to make sure they are what you want them to be.
    3. Your next steps here will be determined by whether the questions you posed in step 3.1 require the comparison of columns with numerical data.  If so, use those columns, if not, pick two other numerical columns from your merged table.
    4. First use which[] to create a new data frame in a new variable that DOES NOT have any NAs in the columns you wish to compare.
    5. Then, use as many cells as you need and the cor() function to explore correlations between the columns you identified.
    6. Add a markdown cell after each code cell to explain what the resulting correlation tells you about your questions.
  5. Categorical Variables
    1. Create a new code cell in the Categorical Variables section. If your questions in step 3.1 required you to explore relationships between categorical columns, use the table() function to first create variables that contain count tables. If your questions in step 3.1 do not require you to explore relationships between categorical columns, pick two categorical columns from your merged table.
    2. For each table, produce a pretty table using ftable() and a proportion table using prop.table()
    3. For each table, use chisq.test() to determine whether the distribution of data in that table is likely due to random chance or not.
    4. Add a markdown cell after each table to explain what the resulting correlation tells you about your questions.
  6. Regression
    1. Return to the question about the relationship between multiple columns that you posed in step 3.1.  In a new markdown cell in the Regression section, make a list of the columns you are interested in, along with whether they are categorical or numerical. Pick one of the numerical variables and identify it as your dependent variable.
    2. Subset your merged data so that each of the columns ONLY contains data that is NOT NA.
    3. Create a set of column titles and subset your data so that you have a new variable containing a new data frame that only has the columns you want.
    4. Create a markdown cell and speculate about how your independent variables might contribute to your dependent variable.
    5. Use as.data.frame(dummy.code()) to create a new variable and a new data frame that contains dummy codes for your categorical variable.  If you have more than one categorical variable, you will need to create more than one dummy code data frame.
    6. Look at the table(s) you just created.  Use the syntax data$new_column <- dummy_data$one_column to add all but one of the dummy code columns to your data frame.
    7. Create a linear regression model using the lm() function.  Refer to the regression example for syntax.
    8. Use summary() to assess which variables contribute to your model.
    9. Create the model again, dropping the variables that were not significant (starred)
    10. Create a markdown cell describing what this model tells you about the relationship between different columns.
  7. (Finally) Introduction and title
    1. Return to the top of your notebook and write up a paragraph summarizing your findings.  In this paragraph, reference AT LEAST three of the readings we have completed this semester.
    2. Add a catchy title.
    3. Download your notebook (file > download as > Notebook (.ipynb) ) and upload it.

Practice Exercise # 2 – Structuring Dataset

Instructions: This exercise continues to develop the oral histories dataset. In your same groups from the data modeling assignment, complete the following. Submit your discussion and revised model via Lyceum by the start of class on October 9th (section A) (section B):

 

REMEMBER TO NOT INCLUDE IDENTIFYING INFORMATION IN YOUR SUBMISSION

  • A revised data model changed to take into account (a) the current state of the oral history data and (b) conversations with your colleagues in the data model share around. This model should include entities, attributes, relationships AND the kind of data for each. So, if you propose an attribute of something like “work related illness” you would specify that this is character/string data, and (probably) suggest that it make use of a controlled vocabulary (i.e. a pre-defined set of illnesses, or at least additional data clarifying how an illness described in the data relates to a diagnostic category)
  • A concrete and specific plan for realizing some part of your revision. (i.e. “read through all histories and tag x” or “strip all numbers” or “make first two rows of history a title column” – NOT “ID diseases” or “clean data” or “add titles”)
    • If your concrete plan involves creating systematic variables from unsystematic language (i.e. creating a variable that captures working conditions from different descriptions of working conditions), then you must include a controlled vocabulary (i.e. easy, moderate, difficult) and an explanation of how you would encode different personal descriptions of (in this example, working conditions) to that controlled vocabulary.
  • A short paragraph explaining how some of the decisions you have made so far have been influenced (or taken in spite of) some of the big ideas in data cultures we have encountered so far.  Make sure to reference AT LEAST three of the starred readings.

You can download the “cleaned” data from the Lyceum assignment folder (there is still more cleaning to do) as well as the R scripts I used to clean it.