This exercise brings together what we have learned about quantification so far.  You will be creating your own notebook from scratch, calling data, manipulating that data, and analyzing it.  Please submit the notebook via this Lyceum link (section A) (section B) by the start of class on October 25th.

Note: Since you all are cleaning the data, you will not be able to start this lab until 5 PM on Tuesday the 16th.


  1. Set up
    1. Create a new notebook. Name it DSC_104_PE3 (do not include your name – I will grade anonymously and then return your notebooks)
    2. Title your notebook. Make your first cell into a markdown cell (hit escape and then m) and then enter one hashtag – this makes the cell into a title cell.  (For more on markdown syntax, look here) Write a placeholder title (you’ll come back at the end and change the title). Run the cell.
    3. Title your sections. Repeat the steps above for seven new cells but use two hashtags (this creates smaller title text).  Name them: Introduction, Calling Data, Exploratory Statistics, Correlation, Categorical Variables, and Regression.
    4. Look over the README I created for our data.  It describes each table, gives the URL that holds that table, and describes each column in the table.
  2. Calling Data
    1. Use the function library(“psych”) to load the psych library of functions.
    2. In the Calling Data section, name variables and import each .csv mentioned in the README.  Use the read.csv() function, which takes the path to the file (in this case, the URL of the file) and header = TRUE (so that we have column titles)
  3. Exploratory Statistics
    1. In the Exploratory Statistics section, create a markdown cell, and write up two questions that you think this data can answer.  These should be about the relationship between different columns the data (i.e. What is the relationship between date of birth and country of origin? – but don’t use that one, since we’ve already covered it in practice.)  The first one should ask a question about the relationship between columns from different tables. The second one should ask about how three or more columns relate to each other.
    2. Create another cell in the Exploratory Statistics section and use summary() and class() to gather information about each of the columns you mentioned in your question. Remember that you will want to identify a variable and a column in that variable using the syntax variable$column
    3. Create a final markdown cell in the Exploratory Statistics section and describe any insights.
  4. Correlation
    1. Look at your questions, and the tables they reference.  Make a plan for merging those tables.  Remember that the merge() function takes in the two data frames that are being merged, as well as the column in each that contains the information that bridges both.  Create a markdown cell in the Correlation section and write up your merging plan.
    2. In a cell below the one you just created, merge the tables that you identified.  Make sure to create new variables to hold those merged tables. Create a new cell to view your tables to make sure they are what you want them to be.
    3. Your next steps here will be determined by whether the questions you posed in step 3.1 require the comparison of columns with numerical data.  If so, use those columns, if not, pick two other numerical columns from your merged table.
    4. First use which[] to create a new data frame in a new variable that DOES NOT have any NAs in the columns you wish to compare.
    5. Then, use as many cells as you need and the cor() function to explore correlations between the columns you identified.
    6. Add a markdown cell after each code cell to explain what the resulting correlation tells you about your questions.
  5. Categorical Variables
    1. Create a new code cell in the Categorical Variables section. If your questions in step 3.1 required you to explore relationships between categorical columns, use the table() function to first create variables that contain count tables. If your questions in step 3.1 do not require you to explore relationships between categorical columns, pick two categorical columns from your merged table.
    2. For each table, produce a pretty table using ftable() and a proportion table using prop.table()
    3. For each table, use chisq.test() to determine whether the distribution of data in that table is likely due to random chance or not.
    4. Add a markdown cell after each table to explain what the resulting correlation tells you about your questions.
  6. Regression
    1. Return to the question about the relationship between multiple columns that you posed in step 3.1.  In a new markdown cell in the Regression section, make a list of the columns you are interested in, along with whether they are categorical or numerical. Pick one of the numerical variables and identify it as your dependent variable.
    2. Subset your merged data so that each of the columns ONLY contains data that is NOT NA.
    3. Create a set of column titles and subset your data so that you have a new variable containing a new data frame that only has the columns you want.
    4. Create a markdown cell and speculate about how your independent variables might contribute to your dependent variable.
    5. Use to create a new variable and a new data frame that contains dummy codes for your categorical variable.  If you have more than one categorical variable, you will need to create more than one dummy code data frame.
    6. Look at the table(s) you just created.  Use the syntax data$new_column <- dummy_data$one_column to add all but one of the dummy code columns to your data frame.
    7. Create a linear regression model using the lm() function.  Refer to the regression example for syntax.
    8. Use summary() to assess which variables contribute to your model.
    9. Create the model again, dropping the variables that were not significant (starred)
    10. Create a markdown cell describing what this model tells you about the relationship between different columns.
  7. (Finally) Introduction and title
    1. Return to the top of your notebook and write up a paragraph summarizing your findings.  In this paragraph, reference AT LEAST three of the readings we have completed this semester.
    2. Add a catchy title.
    3. Download your notebook (file > download as > Notebook (.ipynb) ) and upload it.

Leave a Reply

Your email address will not be published. Required fields are marked *