PSC 4099 - Schedule and Assignments

Unit 2: Operationalizing your hypotheses and coding variables so you can test them (materials for the other course units are accessible from the course homepage).

Materials you will find useful for this unit:

The course's big deliverables:

The grading rubric for your thesis

The grading rubric for your final presentation of the thesis. Note: some aspects of this assignment will be changed due to social distancing and masking requirements. We likely will do the presentations using virtual meeting software.

Thesis assignments (these may be modified as their due dates approach, so don't print them out way ahead of time):

Thesis assignment 1: Identifying a thesis research question

Thesis assignment 2: Beginning to operationalize your research project with suitable data and variables

Thesis assignment 3: A draft of the "front-end: of your thesis" (i.e., a review of previous research to explain why your project is important, what we already know about it, and what new information about it is worth knowing)

Thesis assignment 4: A draft of the thesis section describing your hypotheses data, variables, and methodology (which will require you to attach a codebook that you will create with this template).

Thesis assignment 5: A draft of the thesis's findings (including tables and figures) and conclusions

Professional Development Assignments (these may be modified as their due dates approach, so don't print them out way ahead of time):

Prof. assignment 1: Your mentor meetings (Specifics and grade rubric distributed by email)
Prof. assignment 2: Using SPSS and interpreting its output (The workshop materials will be added to the PPT/Assignments folder when we get to that point in the term).
Prof. assignment 3: Preliminary presentation

Week 6 (9/24, 9/26): Your in-class time this week will be spent reviewing how to transition from the front end of thesis paper into the quantative analysis sections, starting with how operationalize theories with hypotheses and variables. This also is the week week where we start reviewing how to create and recode variables in SPSS syntax. Outside of class, most of your time should be spent polishing up the "front end" of your thesis.

Looking ahead: Due this Sunday (September 29): electronically submit a copy of Thesis assignment # 3 (a polished draft of the "front end" of your thesis).

Before the start of Tuesday's class, Please make sure you have done each of the following tasks that should have been done several weeks ago (review the materials/screencasts from week 3 if you are not up to speed yet):

Downloaded to your computer the dataset/s you are going to use your project and put them in folders and subfolders in your computer's "Documents" folder.
Set up Google Drive (preferably) or some other service to automatically mirror the files in your "Documents" folder to the cloud so that you have a local (i.e., on your computer) version of all thesis-related materials and a cloud backup that automatically syncs changes anytime you save a change to a file on your computer.
Created a backup copy of your thesis dataset/s so you don't accidentally harm the original data; don't ever work with the original data.
Created working versions of your dataset/s, removing any observations that are not relevant to the study; you also should consider saving a small version of the dataset that only has variables that have any chance of being relevant to your study (important: if your dataset doesn't have more than a couple of dozen variables, it is ok to keep all of the variables yet unless you have a very good idea about all of the questions--including control variables--are going to go into your study. The reason for making a version of the dataset with only some variables is to make it less complicated to find the variables you care about as you run statistical analyses. As we saw in class and the screencasts from week 3, when you save your dataset-prep sytnax, it is very easy to go back and add additional variables from your large dataset_copy file to a revised dataset_small version.
If you need help doing any of the steps above, see the materials from the Unit 1 schedule. If you are having trouble preparing your datset for analysis, this handout summarizes the necessary steps.

Also before class on Tuesday, reach out to your thesis mentor to set up an appointment to meet again at the next week before you leave for break or immediately afterward (i.e., you should have two meetings before we are a week back from break). At least two days ahead of that meeting, you will need to provide your mentor with a copy of Thesis assignment #3.

If your project is not as developed as you would like at this point in the course, aim to meet with your thesis early in the week after break (it won't be helpful if you meet twice before you have had time to make substantial improvements on your paper since met the first time..

Also before class on Tuesday, download and print out “Measuring Bias against Female Political Leadership." (password is still: icecream). Please read carefully up to "Multi-variate Analyses," on page 16. The main reason this article is being assigned is because it provides a model of a literature review, hypothesis specification, and how to recode dependent, independent, and (if applicable to a project) control variables. These are three tasks you are going to be doing with your own project over the next two weeks. Keep a copy of this article, because we may also will use this article as one of the sample readings when we are reviewing the interpretation of regression again later in the term.

In class on Tuesday, we talk about the article and how to effectively align the front-end of yoru paper with the methodological sections, so that your theory (aka, the justification for a study and discussion about what is known and still needs to be known about the topic), hypotheses, and variable coding stay focused on the same set of variables.
Any extra time in Tuesday's class will be spent starting to think about your project's codebook (see this example), which you will be creating create for your project with this template. A codebook is a document--typically available in a study's appendix--that identifies the specific survey questions you are using in your project and how you exactly you will recode them into the variables you need. The codebook needs to address how you intend to deal with non-responses, refusals and "missing" data. For our purposes, your codebook should indicate any variable that has been reverse coded and any continuous variables rescaled to range from zero to one.

At this point in the semester, you don't need to have a fully polished codebook before you start to code your variables; however, you will be asked to submit this document as part of a presentation that you will be making in week 10.

For Thursday, please be ready to make the most out of a class that will be spent reviewing how to create different types of variables in SPSS. You will need to bring your laptop to class.

After class, you will be asked to complete an SPSS assignment on BlackBoard that will give you practice creating each of the variable types described above, except for an additive index because senior theses rarely use those. The SPSS assignment will due at the start of next Thursday's class.

In class, we will star to cover:

Using the RECODE INTO command to create a new variable. This command is typically used if you need to:

Deal with non-responses or "don't knows" in the original item. While some original datasets will have coded these responses as missing data, most survey organizations assign very high numbers for these response categories so that something will look wrong if you run a discriptives on the variable. It is important for researchers to carefully think about how they want to deal with unprompted "Don't knows" or refusals. For example, on a question about whether a person thinks they are an evangelical Christian, researchers typically would code individuals as not belonging in this group if they said, "I don't know." With other questions, coding the "don't knows" as missing data would be more appropriate. With RECODE INTO, you make this choice.
Collapse a variable's response categories into fewer groups or into a dummy (0/1) variable. For example, we may want to collapse an 8-level measure of education into three levels (less than high school, no 4-yr college degree, and college degree).

Using a series of COMPUTE and IF commands to create a new variable. This type of command need to be written in syntax and appear in a logically correct order. This is a good approach to:

Create a dummy variable when you don't want to use the RECODE command, For example:

COMPUTE White= $SYSMIS.
IF ethnicity= 1 White = 1.
IF (ethnicity >1) and (ethnicity < 98) White = 0.
*If we see from a frequency table for the original variable that there is "missing" data on the original variable, we need for that to carry over to the new variable:
IF MISSING(ethnicity) White = $SYSMIS.

With this code, every respondent would be initially coded as missing data. Then, all respondents originally coded as 1, would be coded 1 on the variable "white." Then, *every* respondent who was originally coded between 2 and 97 would be coded zero. Doing it this way, someone who hadn't identified their race and was coded 99 would be coded as missing data on the variable "white."

To create a dummy variable that includes information from multiple original items. For example:

There are lots of different logical approaches that can be used to code variables like this:

COMPUTE LatinoMale= 1.
*Here, assume Latinos were coded 3 on RaceOriginal:
IF (RaceOriginal = 3 OR Male =0 ) LatinoMale = 0 .
*To deal with any missing data on the original variables, we have to add additional lines of code:
IF MISSING(RaceOriginal) LatinoMale = $SYSMIS .
IF MISSING(Male) LatinoMale = $SYSMIS .

Notice that that you can use NE (not equal) in an IF statement to create the same variable:

COMPUTE LatinoMale = $SYSMIS.
IF (RaceOriginal NE 3 OR Male NE 1) LatinoMale = 0 .
IF (RaceOriginal = 3 AND Male= 1) LatinoMale = 1 .

IF MISSING(RaceOriginal) LatinoMale = $SYSMIS .
IF MISSING(Male) LatinoMale = $SYSMIS .

To reverse code a variable with lots of response categories. You can do this with a RECODE INTO command, but using compute is faster. If you have a 7-level likert measure of how satisfied respondents are with the federal government, you can use a COMPUTE command to reverse code that variable by subtracting each respondents score from 8 (i.e., the original variable's highest value plus one) and then add an IF command if necessary to deal with people who provide "don't know" or refusals on the original item. If we did this, someone who was coded 1 would be a 7 (8-1) and respondents previously coded 7 would now be 1 (8-7). For example:

COMPUTE NotSatisfiedWithFedGOV7 = (8 - SatisfiedWithGovOriginal).

*Here, assume don't knows and non-respondents were coded 99 on the original item.
IF SatisfiedWithGovOriginal = 99 NotSatisfiedWithFedGOV7 = $SYMIS.

*And add a line to deal with any missing data on the original variable:
IF MISSING(SatisfiedWithGovOriginal ) NotSatisfiedWithFedGOV7 = $SYSMIS

To rescale a variable to range from 0-1, which can be useful when interpreting some types of statistical results. In this situation, COMPUTE is used to (1) subtract a value from each respondent so that the respondents with the lowest value will be coded zero and, then, (2) to divide by the new maximum value, so that each respondent's response is a fraction ranging from zero to one. Using the example above, if we subtract one from each respondent's NotSatisfiedWithFedGOV7 score, everyone (except for non-respondents and don't knows) will have a value ranging from zero to six, which we then divide by 6:

COMPUTE NotSatisfiedWithFedGOV_0to1= (NotSatisfiedWithFedGOV7 -1)/6.

*and deal with any missing data on the original variable:

IF MISSING*NotSatisfiedWithFedGOV7) NotSatisfiedWithFedGOV_0to1= $SYSMIS.

To create an additive index. You might do this if you had three questions asking respondents how often they engage in three different types of participation and you wanted to create a single measure that summed each respondent's values. Your code would look something like this:

COMPUTE PolPartic3to15 = ProtestMarches5 + InstagramPolitics5 + TalkPolitics5.

*and deal with any missing data on the original variables:

IF MISSING(ProtestMarches5) OR MISSING(InstagramPolitics5 ) OR MISSING(TalkPolitics5)
PolPartic3to15 =$SYSMIS.

*Incidentally, if I wanted the new variable to have a low value of 1 (recall that adding the three variables together has resulted in a minimum value of 3, I could have used this code for the compute command instead:

COMPUTE PolPartic12 = ProtestMarches5 + InstagramPolitics5 + TalkPolitics5 - 3.
IF MISSING(ProtestMarches5) OR MISSING(InstagramPolitics5 ) OR MISSING(TalkPolitics5)
PolPartic3to15 =$SYSMIS.

Some extra recorded mini lectures for weeks 6 and 7: If you need more guidance on variable coding, you can optionally watch one or more of the screencasts linked below. Before you invest the time to review any of these screencasts, keep in mind that for the vast majority of HPU students working with datasets will need to know only (1) how to recode variables into new variables, (2) create dummy variables that combine information from two or more variables, and (3) reverse-code variables so that, for example, a 10-point variable's highest value on the original becomes the lowest value on the new variable.

If feel like you need more guidance on the RECODE INTO command than what you have receive during our practice work in class and the SPSS-Blackboard assignment you can watch this screencast (15min, 58sec; 10 minutes or so if you watch at 1.5x speed). The screencast covers recoding and labeling variables as well as labeling response categories for a new variable.

Below is a summary and some more information for what is covered in the screencast so that you need to watch it only once (or perhaps not at all):

The screencast focuses on recoding an original variable into a new variable (From this part of the course forward, I typically highlight statistical and SPSS methods that almost every student will be use in their thesis because some of what we cover is either applicable only to a minority of students or is covered to teach you a method that will be required in the thesis). The RECODE INTO command is your go-to method when are working with only one original variable and want to collapse its categories (perhaps into a dummy variable, like making the 0-1 variable Latino out of the original variable Race5) or reverse code its values (e.g., using Conservative7 to make the new variable Liberal7).

(1) Let's say that we want to create the second of the examples mentioned above: Libera7. First, run a CODEBOOK command on our original variable, Conservative7, in syntax so you can see how it is numbered and labeled:
CODEBOOK Conservative7.

(2) Next, if it's available, it would be helpful to have the original variable's exact question phrasing pasted into our syntax so you can easily recall how exactly the variable was phrased later on.

You can write or copy and paste annotations into the syntax right before the RECODE INTO command. If you put an asterisk in front of an annotation with question phrasing pasted from the questionnaire, it will be greyed out so we can see it, but SPSS won't stall when it gets to that part of the syntax. SPSS will keep greying out language as long as their are no line breaks AND you don't click on the return and end a line with a period. So, when you are ready to end the annotation, use a period and click on return.

(3) Start to recode the variable by using SPSS's point-and-click interface, going to Transform -> Recode into New Variables. you want to use the RECODE INTO command which preserves all aspects of the original variable in case you make any recoding mistakes.

In this example, you will need to tell SPSS that you are going to transform HowConservative10 into a new variable named HowLiberal10. Label the new variable in a way that makes sense and is specific (e.g., don't name and label a variable "Gender" if you are creating a dummy variable for people who identify as male because you may forget who is coded one later on), Then the button "Change." Nothing will seem to happen, but the new variable will have that label when it is created late.

(4) Now, click the button for "Old and New values." This is the area where you do all of the actual recoding. Make sure that you think through how you need to recode. You need to recode every value in the original variable into some value (or system missing) in the new variable.

(5) It is best practice in recoding to start off by indicating that any value that was "system or user missing" in the old variable should be coded as "system missing" in the new variable. Choose those options and click on the add button.

(6) Now you need to recode the values. So 10 in the old variable = 1 in the new one, and click on the add button. And then 9=2 and "add," and so on until all ten of the values are reversed for the new variable.

(7) Once you have made sure to include instructions that will recode ALL of the original values into the new variable, click on continue and then paste. If you accidentally click on the OK, button, just point and click your way back through the transform command where all of the data you just entered will still be there, and this time click on paste.

(8) To create the new variable, you have to select and run the just code you pasted into syntax (run with the green arrow on the menu). Whenever you create a new variable, do what you can to make sure you did the recoding correctly. To do this run a frequency for the original and new variable and make sure that everything looks right. If you didn't code the variable correctly, look at your syntax and see what went wrong. If you need to edit the syntax, you can run the whole RECODE INTO command over again, and it will recreate the variable (this is a cool improvement over older versions of SPSS, which made you delete a variable before you could recreate it).

(9) You need to make sure any new variables and their response categories are labeled . The fastest way to label variables is to do so in syntax. If you are using RECODE INTO SPSS will have created the code to label a variable for you, However, you also will need to add value labels to any new variable you create. Here's what that code looks likes if we want to label just the anchors on our 7-point measure of liberalism (if we were working with a dummy, multi-category, or ordinal variable we would label each value):

VARIABLE LABELS Liberal7 "How liberal is the respondent?".
VALUE LABELS Liberal7
1 "Very Conservative"
7 "Very liberal".

Notice that there are two periods for the two label-related commands. There is only a single period at the very end of each command, and it goes outside of any parentheses.

Note also: if you want to change the variable label or any of its value labels, you can make the edits and just rerun the commands. This is one of the big advantages to working in syntax.

Finally, notice that once you have created the recode+relabel syntax for a variable or two, you can copy and paste it and quickly swap out values and phrasing to very quickly create a whole bunch of similar variables.
If you need more guidance on using COMPUTE + IF statements to create a dummy variable from just one original variable, here is a screencast that goes over that technique (9 min or so if you listen at full speed).

Here's a sample of what that looks like, using code created in the screencast with one critical difference. After I recorded it, SPSS changed how you deal with missing data on the original item. See the highlighted code below:

*Create a new variable: Catholic, that has no data for any respondent.
COMPUTE Catholic = $SYSMIS.

*Now recode to zero everyone who was coded 1 or higher on the original religious denomination variable,
IF religion > 0    Catholic = 0.

*And keep flipping the values for respondents so that only the Catholics are coded 1.
IF religion = 1    Catholic = 1.
IF religion >1    Catholic = 0.

*Now flip the values so that people who refused to give their religion on the original variable are coded as missing responses on the new variable if that data was given a number (which you would see with a CODEBOOK command on the original variable.
IF religion > 97 Catholic = $SYSMIS.

*And make sure that any missing values on the original variable carries over into the new dummy variable
IF MISSING(religion) Catholic = $SYSMIS.

*Now, add the required labels for the new variable.
VARIABLE LABELS Catholic "Catholic" .
VALUE LABELS Catholic
0 "Not Catholic"
1 "Catholic"
.
EXECUTE.

If you need more guidance on using COMPUTE + IF statements to create a dummy variable with information from two or more other variables, here is a screencast that goes over that technique (9 min or so if you listen at full speed).

Below is a summary of what is covered in the screencast, only it uses an example (a dog's breed and color) that may be easier to remember. After I recorded the screencast, SPSS changed how you deal with missing data on the original item. See the highlighted code below:

Let's say we have a sample of dog owners who all own one dog. They have taken a survey asking questions about their dog. We want to create a hypothetical new dummy variable identifying the owners of white poodles. Our new dummy variable, WhitePoodle, will be created with information from two original variables: dog_breed and dog_color.

A CODEBOOK command indicates the following:

*dog_breed = 12 if dog is a poodle, while values of 1 through 55 refer to other dog breeds.

*dog_color = 1 if a dog is white, while values 2 through 10, refer to black, brown, grey, spotted brown, etc.

*If a respondent didn't know or refused to give information about their dog, the relevant variables were coded 99 in the dataset.

Here is the block of code we could use to create the dummy variable:

COMPUTE WhitePoodle = $SYSMIS.
IF (dog_breed < 99) AND (dog_color < 99) WhitePoodle =0.
IF (dog_breed = 12) AND (dog_color = 3) WhitePoodle =1.

IF MISSING(dog_breed) OR MISSING(dog_color) WhitePoodle =$SYSMIS.

VARIABLE LABELS WhitePoodle "Dog is a white poodle".
VALUE LABELS WhitePoodle
1 "White poodle"
0 "Not a white poodle" .Note where the periods are.

Note that COMPUTE, IF, VARIABLE LABELS, and VALUE LABELS are all commands, so one--and only one--period goes at the end of each full command even if that command stretches over more than one line of code.

Here is a summary of what is covered in the screencast, only it uses an example (a dog's breed and color) that may be easier to remember:

Let's say we have a sample of dog owners who all own one dog. They have taken a survey asking questions about their dog. We want to create a hypothetical new dummy variable identifying the owners of white poodles. Our new dummy variable, WhitePoodle, will be created with information from two original variables: dog_breed and dog_color.

A CODEBOOK command indicates the following:

*dog_breed = 12 if dog is a poodle, while values of 1 through 55 refer to other dog breeds.

*dog_color = 1 if a dog is white, while values 2 through 10, refer to black, brown, grey, spotted brown, etc.

*If a respondent didn't know or refused to give information about their dog, the relevant variables were coded 99 in the dataset.

Here is the block of code we could use to create the dummy variable:

COMPUTE WhitePoodle = $SYSMIS.
IF (dog_breed < 99) AND (dog_color < 99) WhitePoodle =0.
IF (dog_breed = 12) AND (dog_color = 3) WhitePoodle =1.
IF MISSING(dog_breed) OR MISSING(dog_color) WhitePoodle =$SYSMIS.

VARIABLE LABELS WhitePoodle "Dog is a white poodle".
VALUE LABELS WhitePoodle
        1 "White poodle"
        0 "Not a white poodle" .Note where the periods are.

Note that COMPUTE, IF, VARIABLE LABELS, and VALUE LABELS are all commands, so one--and only one--period goes at the end of each full command even if that command stretches over more than one line of code.

Here is the logic behind how the code was written, step by step:

Step 1
To see how the responses for dog_breed and dog_color are coded and labeled in the dataset, we run a codebook command:
    CODEBOOK dog_breed dog_color.

Step 2
Now, tell SPSS to create a new variable where all of the values are blank (i.e., are "system missing"):
    COMPUTE WhitePoodle = $SYSMIS.
If we were to look at the data set in the SPSS data viewer (Data View tab), we would see a new column for a variable name "WhitePoodle." All of its values would be blank.

Step 3
Tell SPSS to turn almost all of those black values into a zero if the observation should not be system missing:
    IF (dog_breed < 99) AND (dog_color <99) WhitePoodle =0
Notice this set changes the blank to a zero (meaning not in our group) for every respondent who didn't say don't know. Now, if we were to look at the data set in the SPSS data viewer (Data View tab), we would see most the values for "WhitePoodle" are zero.

Step 4
Now, tell SPSS to turn some of the respondent's values into a 1 if certain conditions are met:
    IF (dog_breed = 12) AND (dog_color = 3) WhitePoodle =1.
This step turned all of the zeroes into one for people with white poodles.

Step 5
Now, tell SPSS to turn some of the respondent's values into missing data if they had missing data on either of the original items::
    IF MISSING(dog_breed) OR MISSING(dog_color) WhitePoodle =$SYSMIS.

Step 6
Now, tell SPSS to label the new variable and its response categories. For example:
    VARIABLE LABELS WhitePoodle "Dog is a white poodle".
    VALUE LABELS WhitePoodle
    1 "White poodle"
    0 "Not a white poodle" .

Step 7
Run a frequency on the old and new variables to verify that general pattern and number of observations looks right. If you made a mistake, read carefully through the code. If you make edits, you can run the block of code over again; SPSS will delete the variable and replace it with the edited version.

Here is a screencast that reviews the use of COMPUTE + IF to reverse code an interval variable with 10 possible values. It runs a bit over 10 minutes long, but only the first 6min 30 sec covers the reverse coding. I left the rest of the screencast because it shows your instructor making a very common mistake when adding variables (initially, both the VARIABLE LABELS and VALUE LABELS commands neglected to tell SPSS what variable needed to be labeled).
If you need more guidance on using COMPUTE and IF statements for more complex situations (which depends on what your project's independent and dependent variables look like), you may want to review this screencast (12min 53sec) at some later point in the term. It covers creating variables like "interaction terms" and "indexes" that combine multiple variables into a single measure. If you just want to create an additive index that combines the results of several questions (e.g., you might want to use four 5-point measures of different types of support for government intervention--previously reversed coded--into a single 0-20 point measure), you can start the screencast at 9m 05sec).

Week 7 (10/1, 10/3): This week will be devoted to lab time, where you will be recoding variables for your SPSS homework assignment. If you have extra in-class time, put it to use coding data fro your dataset, creating your main dependent, independent, and control variables in SPSS syntax.

Sunday, 9/29, Submit Thesis Assignment #3, your study's "Front end" by 10pm unless you were given an extension (i.e., you must turn in the assignment by the later of this due date or one week after you were given feedback on your Thesis Assignment #2)

You also should have booked an appointment to meet with your thesis mentor for a second time at the end of this week or have an appointment to do so the week back from break, When you meet will depend on their availability as well as how much progress you have made since your last meeting with your thesis mentor.
On Tuesday and Thursday, we will continue to work on variable recoding. I want the SPSS assignment done ahead of Thursday's class so you can work with your projects dataset to create your main dependent, independent, and perhaps some control variables in SPSS syntax.
Ahead of Tuesday's class, make sure you have booked an appointment to meet with your thesis mentor for a second time. This meeting should at the end of this week or early the week back from break, When you meet will depend on their availability as well as how much progress you have made since your last meeting with your thesis mentor.
Ahead of Thursday's class, make sure that you have completed the SPSS assignment in BlackBoard on variable coding,

Outside of class this week, you should be continuing to revise your project's front end (i.e., keep polishing it even after your have submitted it so that it better aligns with the variables you are creating). The rest of the week's "homework time" should be spend on your project's codebook (i.e, determining how your dependent, independent, and control variables will be coded) and coding data for your SPSS homework and thesis project.

Week 8: No class on 10/7 or 10/9: Enjoy the mid-term break!

Looking ahead:

In Week 9, we will be reviewing the basics of univariate and bivariate statistics as well as how to summarize bivariate relationships in SPSS and Excel charts.
In week 10, everyone will be presenting their thesis projects. At that time, the expectation is that you will have coded all of your project's variables correctly, have a draft of your codebook done, and be well into drafting the methods section of your thesis.

To make it easier to find things, I have broken up the assignments calendar into multiple units. The material for the next part of the course can be accessed by going to the course homepage and following the appropriate links.

SENIOR SEMINAR

CONTACT COURSES RESEARCH STUDENT RESOURCES