RESEARCH METHODS IN POLITICAL SCIENCE


COURSE HOMEPAGE     STUDENT RESOURCES



Please note: At your instructor's discretion, there may be minor alterations to the reading assignments listed below. One of the major advantages to providing you with an on-line readings archive is that timely articles can be added or substituted when appropriate. Opening documents downloaded from this website will require that your computer have Acrobat Reader . You will also need the class-specific password to open individual files.


UNIT 3 ASSIGNMENT SCHEDULE


Links to helpful resources:

 

Week 12

Topic 13 (Monday, 11/4)Determining whether two or more variables' means and proportions are statistically different with T-tests.

In most social science studies, researchers want to explore whether variation in one or more independent variables correspond to differences in a dependent variable. In the sample article you were asked to read this week, it was hypothesized that certain characteristics and views made a person more likely to vote for a controversial Brazilian candidate (measure with a dummy variable). In class, we asked whether folks who vary by gender, partisanship, or race have different average household incomes (an interval variable). Now we will learn how to see if differences among observed in a sample are likely to hold for the larger population. 

  • What is a t-test? If we just want to see if how different groups of respondents varied in how they answered a survey question, we can split our data and then calculate descriptive statistics for a dependent variable we care about. And it can be helpful to display those differences in barc harts, which can be quickly created in a spreadsheet. However, how do we know if any differences we are seeing in our sample are large enough that we would expect that finding to hold for a larger, more general population? To test whether this is the case, we use T-tests. For example, we could use a T-test to see if what we saw in our sample--that more Republicans make a little more money than Democrats--would hold up if we repeatedly surveyed representative samples of Americans.

  • Ahead of class, carefully read the first sections of Chapter 8, "Bivariate Analyses" in Carolyn Forestiere's textbook. Read just up to the section on correlation analysis (i.e., just the first six pages of the chapter). Correlation analysis will be covered after the Unit 2 test.

  • Before class, read about how to do T-tests in SPSS: This topic is covered in the how-to handout for statistical analysis to add a summaries about how to use SPSS for both types of T-tests described in this week's materials.

  • Print out and keep handy this one-page handout of annotated SPSS output for T-tests.


When and how do we use independent sample t-tests?

  • Forestiere's chapter talks the most commonly used t-test, an "independent samples test." This test assumes you are looking at whether two groups coded on the same independent variable have statistically different means for a second variable. Staying with the example I just gave, this test only be appropriate if you have a categorical variable where respondents were coded something like: 1=Democrat, 2=Republican, 3=Independent, and 4=Other party.

  • Optional: Watch after class if you need more guidance on calculating independent samples T-tests: https://youtu.be/KADpYio2W3U (a little over five minutes). In the example video a T-test is used to determine if the mean level of support for torturing terrorism suspects is different for Democrats and Republicans. 

Some things to remember from the video:

(1) To run an independent samples T-test; Analyze -> Compare means -> Independent samples T-test. Then, select a variable whose mean you want to examine across several subgroups. 

(2) You next need to specify which values of the grouping variable will be compared (click on the button that says "defined groups). In the sample video, Republicans were coded one; Democrats two, and Independents 3 in the original dataset. To compare the means of Republicans versus Independents, the values 1 and 3 would be specified.

(3) Make sure that you are looking at the correct block of results and the correct column to determine if the difference in means in statistically significant. The significance test you want is in the bottom block of output (not the "Group Statistics," but rather in the block "Independent Samples Test." In that block, look at the top row of results ("Equal variances assumed") and find the column labeled Sig. (2-tailed). To repeat, the one you are looking for is in the row for "Equal variances assumed."

(4) Only if two-tailed significance statistic is SMALLER than .05 can we say with any confidence that the mean values for the two groups are statistically different and that we would reach the same conclusion if we drew repeated samples from the same population; conversely, a significance statistic that is LARGER than .05 indicates that the two groups do not have statistically different means.


When and how do we use single-sample t-tests?

  • Forestiere's chapter does not discuss a second type of t-test, the single sample t-test, which can do everything an independent-sample T-test does and more. Since she doesn't cover this second type of T-test, please read through the paragraphs below very carefully.

  • A single sample t-test determines whether the mean value for a group on a dependent variable is different than the mean value for another group (or groups) that are measured by other variables.  

    Your textbook refers to the "DataPrac" survey, which is a dataset that comes with your textbook so that you can practice methods discussed in the book. We are not using that dataset this semester; however, if you were to analyze the DataPrac survey's variable D72, you would see that the typical American (i.e. the survey mean) had a response value of 7.17 on a 10-point scale that measures religiosity. For this indicator, respondents were asked to place themselves on a scale where 1 means that God is "not at all important" in their life and a ten indicates they believe God is "very important" in their lives. Is the sample's mean value for this variable different than the mean for individuals who say they plan to vote for the Democratic candidate in the next election? How about Republicans? Is the religiosity of men lower or higher than the national average for this item? How about women?

  • We can answer each of these questions and even build a table comparing them if we split our data on each of the relevant independent variables and then run a single sample T-Test (Analyze-> Compare Means -> One Sample T-test) for each group we care about. For each of the tests, we would enter the average for the sample as a whole, 7.17, into the place in SPSS that asks for the "Test value."

    So, if were split the data by the DataPrac variable D14 (partisanship) and run a one sample t-test in SPSS with a test value of 7.17, we see that the mean score for Democrats on the importance of God in their life is 6.60, and the two-sided significance T test reports that the difference for the test value and the mean for Democrats is significant at the .001 level. The same test shows that the average Republican score for the importance-of-God measure is 8.25, which is significant at the <.000 level. In other words, if we survey similar samples 1000 times, we would always expect to find that the average value for Democrats on this variable was always lower than the national average. And separate t-test (because the partisanship variable was split) shows us that the average for Republicans should always be higher than the national average. To see if men or women also are different from the national average for this religiosity variable, we would just need to go back to Data-> Split File -> Compare Groups and swap out the variable D14 for the gender variable (we would leave the test value at 7.17, the average for the full sample).

  • Optional: Watch after class, if you need more guidance on calculating an one sample t-tests: https://youtu.be/paUIJ3Eh7JI (a little over five minutes). In the video, a test is run to see whether the average for the variable male (coded 0/1) is different than the value of .50, which is about the percentage of men we would expect to find in a nationally representative sample (e.g., it would be due to something other than sample error if we were to find that 60% of a 3000-person random sample was male.

Some things to remember from the video (so you don't need to watch it more than once... or maybe even at all):

(1) To run a one sample T-test; Analyze  ->Compare means -> One sample T-test. Then, select a variable whose mean you want to examine.

(2) You next need to specify a "test value" to which you want to compare the mean for your variable of interest. In the video, the mean for the variable male is .54, which is compared to an expected value of .50. Per the commentary above, the test value of .50 was used to see if there are more males in this sample than one would expect to find in a nationally representative survey.

(3) Only if the significance statistic for the "two-sided p" result is SMALLER than .05 can you say with any confidence that that the mean value observed in the sample is truly DIFFERENT than the expected value; a significance statistic that is LARGER than .05 indicates that the the observed and expected mean are not statistically different. In this particular example, a value greater than .05 would mean that the sample's portion of men is not larger than the 50% would expect to find in a nationally representative sample.

(4) It is not covered in this screencast, but remember from the religiosity example above, if you want to test whether the mean for a subgroup (perhaps women, or Republicans, or Catholics) is different than some value (maybe 50%, or the sample average, or the mean of some other group you care about), you can split your data to isolate the subgroup and then run the T-test).


Wed, 11/6—Lab time to work on SPSS #3 in BlackBoard, which will be due by 6 pm on Friday.


Topic 14 (Friday,
11/8)—Determining whether and how much two variables are "associated" with one another 

  • In class, we will focus on three of the many methods social scientists use to determine how two variables are associated: chi-square tests, correlation tests, and bivariate regression..

  • In class, most of our time will be spent continuing to look at bivariate correlation and--if there's time--regression with one independent and dependent variable.

  • After class, please finish Chapter 8 in your Forestiere textbook (you previously read up through the section on t-tests, so start on the section for correlation). Please read the political science examples carefully. It will be faster, easier-to-understand reading if you wait to complete this reading until after we have covered correlation in class.

  • After class, please read the first 15 pages or so of Chapter 9, "Regression," in your Forestiere textbook (up to page 199). Review the section on regression with one independent and dependent variable. It will be faster reading if you wait to complete this reading after we have begun to discuss regression in class.

For X2 (chi square) and other association statistics, here are the key concepts you need to remember from class:

    • Before class, but after you have read about correlation tests in the textbook, read through the block of material below carefully and quickly read a very short reading on what a chi-square test is to give you a clearer idea what SPSS is doing when it runs this kind of an association test. Chi-squared tests are not covered in your textbook, so you need to review this statistical measure in the material below and the assigned, short reading.

    • Before class, print out and have handy this annotated SPSS output for a Chi-Square test. In the sample, the researcher is trying to see if a Brazilian's race (a categorical variable) had anything to do with whether or not the voted for the politician who was elected president in 2018.

    • Because they are the most used association tests in political science and international relations research, we are focusing mostly on correlation and regression. They are the only association measures covered in any detail in the Forestiere textbook. Other than correlation and regression, the only association statistic you need to be familiar with for this course is the chi-square (x2) test.

    • A Chi-square test is what we use to see if any combination of two nominal (categorical) and ordinal variables are associated with one another. For example, you might wonder if a person's race or political party is associated what major religious denomination they belong to.

    • To calculate a Chi-square test, use SPSS's Analyze -> Descriptives-> Crosstabs, If you think that one one variable is the cause of the other, the independent variable (the cause) typically goes into the rows, while the dependent variable should be listed in columns window. To make the table useful, go to the "cells" option, and check only the box for "rows." Check also, the boxes for "observations" and "percentages." Then, in the "statistics" option, check the box for a Chi-squared test. If you need more guidance on this procedure, you can watch this screencast: https://youtu.be/7O3UTYL2A-I (about

    • To get a basic understanding of what a chi-square test is and how association measures work, read carefully just the first seven pages of this document (Read up to the section "residuals"). Here is a summary of what the reading says, with a simplified example:

The main point of the reading is that a chi-square test provides a statistical test to determine whether any association between two nominal/ordinal variables we see in our sample data is due to chance. In other words, what is the probability that repeated sampling would find that a respondent's category for one variable would have nothing to do with their category in a second variable.

An example can provide a basic idea of what a chi-square test looks at. Let’s say that that we have 1000-person sample where exactly half of the individuals have identified as women and half as men. This being a sample from an odd, hypothetical US state, we also have a sample with exactly 50% Democrats, 50% Republicans, and no independents.

If gender has no association all with partisanship in our sample, we would expect to see that 25% of our sample is made up of female Democrats, 25% female Republicans, 25% male Democrats, and the final 25% male Republicans.

However, a hypothetical analysis might reveal that 30% of our sample is made up of female Democrats and 30% is made up of male Democrats. Thus, in our sample, it looks like there is an association between gender and partisanship (specifically, more women than expected are Democrats and fewer are Republican).

The chi-squared test will tell us (and ONLY tell us) whether the association between gender and partisanship that we are seeing in our sample is due to sampling-error chance. The p-value for the test will tell us what is the probability that repeated sampling would sometimes find that women are more likely than men to identify as Republicans which is contrary to our hypothesis and the finding in our sample.  

A p-value of .05 or smaller for the chi-square statistics tells us that there is only a 5% chance or less that the association we are seeing is our sample is due to chance (i.e., survey error, which a function of sample size) and that we should expect repeated sampling to show a similar association at least 95 percent of the time. Given the magnitude of gender differences in the hypothetical sample above and its size (n=1000), the chi-square test would be significant in this case.

However, as is the case with statistical techniques generally, if you are using a very small sample or looking at a variable where you have very few individuals in some response categories, a chi-square test may not return a statistically significant result. This is why it is important to run frequencies on variables and think carefully about whether response categories should be combined (e.g., it is very common to see a multi-racial measure be recoded into a white/non-white dummy variable before analysis if the sample is under 600 or so resp0ndents).

If you want to be a competent consumer of social science research, you should be aware that there are other statistical methods that can provide more accurate tests of association when you are looking at the relationship between any specific combination of two categorical, dummy, or ordinal variables. If you are curious, here is a summary of the association tests SPSS can quickly compute: https://www.ibm.com/docs/en/spss-statistics/25.0.0?topic=crosstabs-statistics. We are learning only about chi-square tests because they are widely reported in both academic and everyday publications.

For correlation statistics, here are the key concepts you need to remember from the chapter and class:

    • To run a correlation in SPSS, use Analyze -> correlate -> bivariate. If you enter more than two variables, you will get a "correlation matrix," showing you the relationship between each pair of variables.

    • The correlation statistic (aka Pearson's R) measures only how consistently the value of one interval variable corresponds with the value of a a second variable (note that it is common to use correlation with dummy variables).

    • Correlation measures should interpreted with close attention to whether or not they are statistically significant. If the p-value is higher than .05, there is no association regardless of how large the correlation statistic is. For correlation, the p-value tells you the probability that any association between what has been found in the sample could be zero or signed in the opposite direction in repeated sampling. To say that one variable is a statistically significant predictor of another, the p-value needs to be .05 or less. In SPSS, make sure to look at the p-value even if you see two asterisks. For some odd reason, the default setting in SPSS only adds two asterisksto coefficients even when the p-value is <=.001, which should be denoted with three asterisks. 

    • Correlation does NOT tell you how much a change in one variable changes the other variable. It also cannot tell you which variable may be causing the other to move. For example, being more conservative is correlated with being more religious, but there are theories to suggest that causality could go either way.

    • Moreover, even if two variables are highly correlated, it could be that there is a third variable that is causing both x and y to change in predictable ways even though those two variables have no actual relationship. For example, in the US, violent crime goes up in the same months that ice cream consumption also goes up in a population, but they don't have anything to do with each other except that both are more prevalent on hot summer days. "Omitted variable bias" is one of the reasons we will be talking about multivariate regression models next week.

    • Most of the association statistics range from -1 to 1, and a negative correlation statistic means that increased values in one variable is associated with declines in the other variable. Typically, positive correlation statistics are not marked with a plus sign.

    • The square of the correlation coefficient (r-squared), is used to estimate how much of the variation in one variable is "explained" by the other one with a key caveat noted above: a missing variable may be explaining some or all the variation... which is why we typically look at relationships between two variables with multivariate regression that includes one or more "control variables" (more on that next week).

As a general guideline for thinking about the association statistics, like correlation:
<.10 means that there is a very weak or no association between the variables;
.20 can be interpreted as a meaningful but modest association;
.30 is a moderate association, and
>.40 is a strong association,
But in every case, you need to put these findings into context (e.g., a .40 association between being Republican and being conservative would be a much weaker finding than you would anticipate, so it wouldn't make sense to refer to this scenario as being evidence of a very strong association).

For linear regression with one independent variable (i.e., bivariate regression), here are the key concepts you need to remember from the chapter and class:

    • You must specify a dependent variable when using regression, but this doesn't mean x actually causes y. Running a regression model will not tell you if the dependent variable is the cause of y. Use theory or logic or logic to determine what the dependent variable is.  

    • Linear (i.e., OLS--Ordinary Least Squares) regression models report an R-square statistic, which is interpreted as noted above in the section on correlation. An R-square statistic of .35 means that the independent variable in the regression model explain 35% of the variation in the dependent variable (and doesn't explain 65%). You will get some sample language on reporting R-square results in the section on regression with multiple variables below.

    • Regression estimates how much a one unit increase in the independent variable will correspond to changes in the dependent variable. Specifically, regression output includes a slope measure for each independent variable. This statistic is called the unstandardized regression coefficient. In SPSS output, unstandardized coefficients are listed in the "b" column of output; make sure you are looking at the first column output in the last block of output). This regression coefficient tells us how much "each one-unit increase in the independent variable corresponds with an x-unit increase (or decrease if the coefficient is negative) in the dependent variable." In plain English, we might say each one unit-increase in the 10-point measure of religiosity, corresponds to a 1.34 point increase in an individual's measure on the 10-point ideological-conservativeness scale."

    • With regression, there is a statistical test for each regression coefficient where the p-value tells you the probability that relationship between what has been found in the sample could be zero or even run in the opposite direction in repeated sampling. To say that one variable is a statistically significant predictor of another, the p-value needs to be .05 or less. These models include a value for the y-intercept (in the output, this is the unstandardized "constant).

    • With regression results, you can predict the value of the dependent variable at selected values of an independent variable. The constant (aka, the y-axis intercept) can be used to predict the value of the dependent variable for a given scenario with a simple formula: DV value = Constant + (a specified value of the IV times the regression coefficient). If the IV is a dummy variable, the language used to interpret its regression coefficient is: "Compared to the reference category of (carefully describe anyone who is not in the group), individuals who are in the group had an x-point higher (or lower if the dummy variable coefficient is negative) value on the dependent variable. In plain English, this might sound like, "Compared to non-Republicans--that is, Democrats plus independents--Republicans' score on the 10-point measure of religiosity was 2.4 points higher."

    • As with correlation generally, regression models assume that every one unit increase in the independent variable will have the same effect on the dependent variable. This is referred to as the assumption of linearity. Examples of how variables can be related with one another but not have a linear relationship include time and investments (over time, investment returns compound so growth is exponential) and the curvilinear relationship between age and physical independence. Regression can handle these types of relationships in a few different ways, one of which is using dummy variables and interaction terms (this isn't the most common way, but it is the only way that fits neatly with concepts you already are going to learn in this class). If we suspected that age has a different effect on income, a series of dummy variables (say, reference group = under 35, with additional dummies for 35-50, 51 - 65, 66-75, and over 75 years old) likely would show that wage-earned income, on average, quickly increases as one moves  . 

    • Completely optional: if you have attended class and carefully read the textbook  material on correlation but feel like you would like to go over the basics of this method one more time, you can watch this 25-minute (12.5 at x2 speed) screencast presentation covering the  logic and main concepts of correlation: https://youtu.be/pjDDBrunB1A. Note: The screencast goes over the same conceptual material we will have reviewed in class, and doesn't cover the use of SPSS.

    • Completely optional: if you have attended class and carefully read the textbook chapter material on bivariate regression but still feel like you would like to better understand the basics of this method, you can watch this 19-minute (10 at x2 speed) screencast presentation covering the basics and logic of bivariate regression: https://youtu.be/K8A6xGIXPR8. Note: The screencast goes over the same conceptual material we will have reviewed in class, and doesn't cover the use of SPSS. 


Week 13

Topic 15 (Monday, 11/11; Wednesday 11/13)—Linear regression with multiple independent variables, including dummy and interval variables

Most of the topics below will be covered in Monday's class. We will finish up any remaining material on Wednesday and spend the rest of that class working on a BlackBoard assignment that asks you to use SPSS to use and interpret linear regression models.


  • SPSS #4 (posted to Blackboard) is due by 6pm on Wednesday.  This is the assignment that gives you more practice coding dummy variables and running/interpreting T-tests. The assignment also covers the main concepts behind Chi-squared tests and correlations as well as how to run these analyses and interpret their results in SPSS. 

  • Ahead of Monday's class, starting with page 199, read to the end of Chapter 9, "Regression," in your Forestiere textbook (17pp). 

  • Print out this document ahead of class: A handout of SPSS linerar regression output with annotations. The handout matches up with the same topic as the screencasts (using SPSS to predict the level of support for torturing terrorism suspects, as measured by a 7-point likert scale (1 = "never justified"; 7 = "always justified") that is being treated here as an interval variable. You should retain your copy of this document because you will find it handy when you complete the BlackBoard assignment on regression.

  • Ahead of Wednesday's class, read the first 13 pages of  this conference paper by Dr. Setzler and Dr. Yanus (up to the note: TABLE 3 ABOUT HERE). This is a first-draft, working-paper version of a study written for a conference section that is primarily interested in gender and politics. The tone and setup of the paper was targeted at a particular audience). A revised, more-focused version of the study adopted a more neutral tone and was published by one of the American Political Science Association's journals; the article has been cited in over 100 other published studies.

The reason you are being asked to read this particular paper is because much of it focuses on what factors predict whether a person values gender equality as measured by variables that can be treated as linear in nature. In other words, it is a study that uses linear regression (the published version only uses "logistic" regression). The conference paper version also includes a few dummy-coded dependent variables, so you will be asked to reread the same paper when we get the section of our course devoted to binary logistic regression.

Below are summary notes for the major concepts covered in class and your textbook to understand how multivariate linear regression works and is interpreted:

General concepts for multivariate linear regression (i.e., more than one independent variable), Here are the key ideas you need to remember from the textbook chapter on regression and class:

  • All of the key concepts listed above for bivariate regression (i.e., one independent variable) apply to multivariate regression, too:

    The R-Square statistic, constant, statistical significance statistics, are interpreted in the same way. All regression assumes that each independent variable has a linear effect with the dependent variable (see what this means by reviewing the notes under correlation and bivarate linear regression that explain the "assumption of linearity"). 

Also, the individual unstandardized regression coefficients are all interpreted a similar way as they are with bivariate regression except that results for each independent variable caculates how much a one-unit increase in that independent variable increases/decreases the value of the dependent variable when the influence of all other variables in the model is held constant at their mean value. For example, if you are looking at the effect of each additional level of education on income and the only other independent variable in the model is the dummy variable Male, the regression results for the education variable would be calculated with an equation that controls for the effect of gender by calculating .5 x the positive effect of being a male (i.e., the regression coefficient for the gender variable). Why .5? This is the mean value for male.


The interpretation of multivariate regression is a bit more completed when you have interaction variables (i.e. MaleXRepublican) or multiple dummy variables for the same variable (e.g., race or political party dummies). These type of variables are discussed below. 

  • As with bivariate regression, multivariate regression results allow you to predict the value of the dependent variable under different scenarios. The way this is done is to assign the scenario values for any scenario you want and independent variables' mean values otherwise. More details on this below, but this is the key idea for how scenarios work. 

  • With multivariate regression, you determine which variables are most important by comparing their "standardized regression coefficients." These are also called "betas" are located in the SPSS output column labeled as such. Recall that we can compare standard deviations of different types of measures in useful ways. So, comparing a 34 score on an ACT to a 1500 on an SAT is not easy, but comparing how many standard deviations each of scores is from their test mean would tell you that that ACT score represents relatively higher performance. A beta tell us how much each one standard deviation in an independent variable increases/decreases the dependent variable's value, measured in standard deviations. In other words, the farther away a given independent variable's beta is from 0 (beta can be positive or negative), the more important that particular independent variable is in predicting the dependent variable's value.garding the use of dummy variables,

  • With multivariate regression, there is the added assumption that each of the independent variables is at least somewhat independent of the others. If two or more independent variables are very highly correlated, the statistical results in the model may be able to determine how changes in those two variables influence the dependent variables. See the note above on multicollinearity. 

To use and interpret regression output with dummy variables, here are the key concepts you need to remember from class and your textbook:

  • Critical: to interpret any dummy variable in a regression model's results, you have to know what the variable's reference category is

  • Whenever you interpret a dummy variable, your interpretation should explicitly identify the reference category. For example, if a regression model only includes the variable Latino, the interpretation of that variable's regression results for this variable will start with this phrasing like, "Compared to the typical non-Latino, the estimated household income of a typical Latino is $1,300 less a year after controlling for the other variables in the regression model."

  • If there is just one dummy variable in a model and it was coded from an original variable that had just two response categories, the reference category is easy to identify. For example, we might have data where respondents have been coded one if they believe freedom is more important than equality and zero if they think equality is more important than freedom. If the regression model includes the dummy independent variable FreedomIsMostImportant, that variable should be interpreted with phrasing like this: "Compared to respondents who prioritize equality, those who think freedom is more important had an x-point higher score on the 10-point dependent variable measuring y."

  • If a regression model includes a dummy variable derived from a multi-category original variable, think carefully about how many of those groups have dummy variables in the model and thus what the proper reference category is. Consider a (logistic) regression model looking a persons' characteristics to predict how much they think NATO is important to international security, measured on a 5-point scale. Let's say this model includes "Democrat" as its only partisan dummy variable. If so, the reference group when interpreting the Democrat coefficient is non-Democrats (i.e., since both independents and Republicans are not in the model, both groups are the reference category). In this example, the variable Democrat should be interpreted with phrasing like this: "Compared to Republican and independent respondents Democrats x-point higher score on the 5-point indicator for seeing NATO as important to international security."

  • If you are running a regression model to test a hypothesis comparing two groups, one of those groups needs to be the reference category. Sticking with the example of looking at how partisanship shapes voting for female candidates (something Dr. Setzler has written a lot about, incidentally), if we want to compare the likelihood of Democrats of voting for a female candidate to Republicans' likelihood, we would need to add a second dummy variable to the regression model for the people who are independents. Once the regression model included dummy variables for both independents and Democrats, those variables regression coefficients could be compared to Republicans, who would be the omitted reference group.

To use and interpret interactive-term variables, here are the key concepts you need to remember from class:

  • We create and use an interaction term in a regression model when we think that the influence of one independent variable on the dependent variable depends on the value of a second independent variable.

    For example, we might want to analyze how being a political science major and the number of hours of studied before an exam in general education political science classes influences the typical student's test score.

    Let's say we collected a year's worth of survey data on students' study habits and their test scores in these types of classes. If we were to calculate a regression model with just these two variables, we presumably would find that both factors are significant predictors of higher test grades but that there are lots of other factors, too (i.e., our model probably wouldn't have a high r-square statistic).

    If we saw that both being a major and studying improved test scores, we might wonder if the effect of each hour of additional study is different for PSC and non-PSC majors. Maybe, studying pays off more for non-majors because they have less of a background in the subject area and more to learn. On the other hand, maybe studying pays off more for PSC majors because they have more interest in the subject and are better able to retain information about it.

    To test either of these hypotheses, we need to create and add an interaction term to our regression model.

  • To create an interactive term variable, you just create a new variable that multiplies together each respondent's value for the two relevant variables. In the example above, we would use SPSS to create a new variable with coding that looks something like:

COMPUTE NewVariable = first_IV * second_iv

So, in this case:

COMPUTE PSCxHrsOfStudyBeforeExam = PSC * HrsOfStudyBeforeExam.

The"interactive term" would then be added to the regression model along with the two variables from which it was formed (both of the original variables and the interactive term must stay in the regression model). And then we would rerun the regression model and look at our results.

  • If an unstandardized coefficient for the interactive term is positive and significant, we know that the combination of the two variables has more of a positive effect on the independent variable than just the additive effect of each variables. In the example, this would mean that each hour of study is paying off more for PSC majors than non-majors.

  • If an interactive term's coefficient is negative and significant, it means that the combination of the two variables has less than the full effect we would expect if we were to add their full effect together. In the example, this would mean that the effect of each additional hour of study is less for PSC majors than non-majors.

  • If an interactive term's coefficient is not statistically significant, the value of the second independent variable does not influence the relationship between the first variable and the dependent variable. In the example, this would mean that there is no difference in the grade improvement payoff of additional studying for PSC majors and non-majors.


In Wednesday's class, we will practice calculating scenarios with regression output.
What is a scenario, and what key concepts do you need to remember from class?  The standard way to talk about the effects of different variables in regression models is to use a regression results table to explain how a one-unit increase in a given independent variable changes the value of the dependent variable when the effect of all of the other variables is set at each variable's mean value. While that makes for nice, succinct tables, discussing situations that compare different types of hypothetical individuals who are otherwise similar can help to bring regression results alive and make them more useful. For example, if we were analyzing how much different kinds of people support the legalization of marijuana, we could calculate a regression model looking at the influence of gender, religious denomination (i.e., dummy variables for several of them), and age with controls for a person's income and education. Using the model's results, we could compare the level of support for a 30-year old male secular versus a 60-year old female evangelical Protestant with comparable incomes and educational backgrounds.


  • How do you calculate a regression scenario? To create a regression results equation, first identify the value of the unstandardized coefficient for the "Constant" (use the value listed in the B column of the SPSS results output box--the last one--that lists each variable). Then, to that constant value add the effect of each independent variable (i.e., its unstandardized regression coefficient) multiplied by a value you specify. If you want to control for any variables, their specified values are their mean for the full sample.

  • Here's an example of a complex scenario (only because of the number of variables is pretty high), The example uses the descriptive statistics and linear regression models reported in the Setzler and Yanus paper you were asked to read ahead of class. If you hadn't just read that study, I would have used a simpler example.

Let's use the article's regression tables and descriptives to create scenario comparing two people's score on the 7-point measure looking at a person's indifference-to-gender-equality (as measured by how unimportant a person thinks it is to fight for gender equality). What is the predicted score for an older Republican male without a college degree who is otherwise similar to other Americans (at least on the variables in the model)? What is the predicted score for a female non-Republican (i.e., independent or Democrat) who is otherwise identical to her male Republican peer?

To calculate the expected indifference-to-gender-equality score for the Republican male, we would use the following formula:
The model's unstandardized constant (1.046)
+ 1 times the unstandardized regression coefficient for Republican (i.e., the scenario)
+ 1 times the Male coefficient
+ 1 times the No college degree coefficient
+ 1 times the aged 45 and older coefficient
+ 0 times the age 30-44 year coefficient
And now we'll add in the controls, using their mean values in the scenario):
+ .792 times the White coefficient (i.e., 79.2% of this sample is were white)
+ .502 times the Religiosity coefficient
+ .441 times the Blue collar coefficient
+ .399 times the Rural coefficient
+ .500 times the Authoritarianism coefficient
+ .294 times the Racial animus coefficient
= The individual's expected value on the 7-point indifference to gender inequality measure.

Formula in hand, you can plug the whole equation into the super-cool online computer at https://www.wolframalpha.com/:

1.046 + (1 * .912) + (.792 * .119) + (1 * .444) + (1 * -.097) + (1 * -.246) + (0 * -.078) + (.502 * -.082)+ (.441 * .190) + (.399 * .013) + (.500 * -.552) + (.294 * 2.69)

WolfFram Alpha tells us that all other things being typical, an otherwise typical older, Republican male with no college degree has an estimated gender-equality-indifference score of 2.71.

Using the same equation in WolframAlpha and changing just the scenario values for Republican and Male variables, we see that the expected value for the female non-Republican is 1.36 on the 7-point scale. That's just about half the value we found for Republican males.
  • Scenarios are helpful in understanding and visualizing the effects of interaction terms. In the example above, we might theorize that that being not having a college degree makes men particularly inclined towards sexist views. This is another way of saying that the effect of educational attainment on sexism is different by gender. To test this hypothesis, we would create an interactive term (here, the dummy MaleXNoCollege) and add it to our regression model along with the variables Male and NoCollege. In calculating our scenario, our equation would include: (1 x the coefficent for Male) + (1 x the coefficient for NoCollege) + (1 x the coefficient for MaleXNoCollege). The scenario for value for MaleXNoCollege is equal to the scenario values for Male and NoCollege multiplied together.    

Finally, I have compiled several screencasts on linear regression that cover the same topics we go over in class. If you feel like you need additional guidance, check out the optional resources below. Watching any or all of the screencasts ishould not be necessary if you attend class and put your best effort into the hands-on practice exercises:

  • Optional: After practicing in class, if you still feel like you guidance on calculating and interpreting linear regression models in SPSS, review this 11 min. screencast that https://youtu.be/xzl8OxPsM8s.
  • Also optional: This 13-minute screencast follows up on the last one with a focus on how you use SPSS to analyze dummy variables and how output/tables with these variables are interpreted: https://youtu.be/I2BEi_CkzK0.

  • Also optional: https://youtu.be/3m66P8PaD3U. In about ten minutes, this presentation explains how we can use linear regression output to predict the value of the dependent variable with different scenarios involving the independent variables (something I just covered in the example above). The standard output in regression will tell us how a one-unit or one standard deviation increase in an independent variable will change the dependent variable when the effect of all of the other variables is set at each variables mean value. While that makes for nice charts, scenarios can help to bring the data alive. For example, what is the level of support for torture on a seven point scale for a 60-year old Republican male who attends church a lot versus a 30-year old female secular Democrat? Here is a handout with the output and calculations covered in the video.

  • One important thing not covered in the last screencast is how control variables work in regression scenarios. As noted abov, including controls is straightforward. Let's say that we wanted to calculate the same regression model and scenarios the screencast, but control for the effect of a person's education. If education were a five-point measure, we would calculate its mean for the full sample and our scenario equation would include one more added component: mean of Edu5 multiplied by this variables unstandardized regression coefficient. 


Friday, November 15. Class time will be used to finish up our overview of linear regression and--if time is available for you to work on SPSS #5 in Blackboard


Saturday, November 16, by 5pm: SPSS #5 (posted in BlackBoard. This assignment is focuses on linear regression, which is the last topic that will be covered in your practice and final SPSS tests. 

Practice SPSS exam (Monday, November 18). This is the first of the two mandatory, but ungraded SPSS exams to prepare you for the graded exam that you will take later in the semester.

For the practice and final version of the tests, you could be asked to any of the the following exercise (i.e., you may not bed asked to do all of these things on any one test, but you will not be asked to do anything that is not listed below):

  • Create a new variable. You could be asked to create a new dummy variable that combines information from either one or two original variables. You should be able o label the new variable and its and response categories. Important: You may bring a notecard (or 3x5" piece of paper with you for the tests); the only thing that can be written on that card is sample syntax reminding you how to create and label a new variable.

  • Split a variable into its subgroups and compare the frequency at which different subgroups for that variable have a particular opinion.

  • Make a bar chart in SPSS (not Excel because of concerns about how long that might take). Your bar chart must show the percentage of one of single variable's subgroups that have a particular opinion (e.g., what respondents whose households make $10K or less think about a an issue).

  • Compare two or more variables' means and standard deviations, explaining what the standard deviations tell us about the distribution of each variable. For example, comparing the typical and range of opinions expressed by Republicans and Democrats about whether "God has given the US a special role in human history."

  • Statistically test whether the typical individual in each of two subgroups that are coded on the same variable are statistically different from one another with respect to a specific attitude (For example, does the typical African American think that "college is a gamble" (versus being a smart investment" than the typical Latino; is any difference statistically significant)? Hint: you need an independent-samples t-test to answer this type of question.

  • Statistically test whether the share of a specific group in the sample is statistically different than what it should be given a known parameter for the US as a whole. For example, African-Americans make up 12.1% of the US population. What percentage of this sample is African American? Are African Americans underrepresented in the sample (when analyzed without teh survey's weights turned on). Is any difference between the share of African Americans in the sample and what the percentage should be statistically significant. Hint: To test this, you need to split the sample by race, use a one-sample t-test, use .121 (12.1%) as the test value, and look at the output for African Americans only.

  • Statistically test whether two subgroups that are coded on two different variables are statistically different from one another or Americans in general with respect to a specific attitude. Hint: To answer this question, you again would need to split your dataset and use a one-sample test. The test value will be the mean for another group or for the sample as a whole, which would be need to calculated separately unless your instructor gives you that value.

  • Analyze and interpret chi-squared statistics for two variables. Recall, that for this type of analysis, you are working with a pair of variables that are dummy, categorical, or ordinal meaures.

  • Analyze and interpret the correlation statistics for a handful of variables. You could be asked to determine whether they could be combined used as independent variables in a regression models (i.e., would their likely be collinearity problems?). You will to interpret the relationship between pairs of variables and their statistical significance.

  • Demonstrate your understanding of the limitations of correlation. You may be asked to speculate about whether we can determine if there likely is a causal relationship between pairs of variables, with one clearly causing the other (e.g., being more conservative and being Republican). You may be asked to identify whether any of several other variables may suggest a spurious correlation between two variables (For example, being male and disapproving of Joe Biden are modestly, but still statistically associated). You might be asked why we see a very low or unexpected association between two variables that probably have a non-linear relationship (take a look at age and household income in the dataset). 

  • To run and interpret a linear regression model with three independent variables. You will be asked to identify and interpret the adjusted R-square statistic. You also will be asked to identify the equations for three scenarios that will involve different levels of one of the independent variables (including its mean, which you will need to calculate with SPSS). The other two independent variables will be set at their mean level of influence in the equations.

  • To run and interpret the same linear regression model examined before as well as up to 5 more independent variables, including an interactive term and multiple dummy variables coded from the same original variable. You will discuss the model's r-square statistic and how r-square statistics have changed as you have added additional variables to the model and what that means.

  • For the same regression model, you will also be asked about the statistical significance of different independent variables, and the meaning of some unstandardized and standardized (beta) regression coefficients. You will be expected to identify the correct reference groups when interpreting results for one or more of the dummy variables. You will be asked to interpret the results of an interactive term.

  • For the same regression model, you will asked what equations should be used to estimate the predicted value of a dependent variable for two or three individuals who are different in specific ways, controlling for their other differences Ii.e, with some variable's set at their mean level of influence). You will need to make sure you know how to create scenarios that involve dummy variables' reference categories and non-mean values for an interactive term. For at least the practice exams, you may be asked to use to Google to solve one or more scenario models.


Topic 15 (Wednesday, 11/19) —Logistic regression and its interpretation

  • In class, you will be introduced to logistic regression, which is the type of regression used when the dependent variable is binary (i.e., when working with a dummy variable as the dependent variable.

  • Ahead of class, take another close look this short conference paper by Dr. Setzler and Dr. Yanus. The reason you are being asked to read this particular study again is because it has several dummy dummy dependent variables, so we will be able to use the same example study when we look at logistic regression in the next block of course materials. If you want to take a look at another example, logistic regression is the type of regression used in the article you previously printed out on Brazil's 2018 election.

  • Ahead of class, please take the time to carefully read and print out: this handout of SPSS logistic regression output with annotations. The document is very similar to the one I posted above for linear regression, but this time the dependent variable--support for torturing terrorism suspects to obtain information--has been recoded into a dummy variable. Respondents who said torture was sometimes, often, or always justified were coded "1"; individuals who said torture is never justified were coded "0."  

  • In class, we will practice interpreting logistic regression tables in the Setzler and Yanus article. We also will practice interpreting SPSS logistic regression output, including interpreting pseudo R-square statistics, statistical significance p-values, odds-ratios exp(B), and Wald statistics,

  • Some key concepts to remember from class:

    • If your dependent variable is dichotomous, you need to use logistic regression rather than linear (OLS) regression. Dummy variables are used as independent variables in OLS/linear regression without any type of mathematical conversion, but when dummy variables are the dependent variable, we need to a different type of regression because each one unit increase in a given independent variable typically does not have the same effect on a dummy dependent variable. For example, assume that the number of injuries a football team has will decrease its number of points in the typical game. OLS regression will work well because it will tell us how many points are lost with each additional injury. However, if we wanted to know how injuries will impact whether the team is likely to win, we'd see that the first injury to two probably has a modest effect; however, after a certain threshold, each additional injury will the team's odds of winning to go down substantially. And, after a certain point, we'd see additional injuries won't harm the team's prospects any more because they already are going to lose. Logistic regression models address the fact that the effect of an increase of independent variable on an outcome varies in predictable ways that linear OLS regression cannot capture.

    • Use bivariate (aka binary) logistic regression only with dichotomous dependent variables that have been coded zero or one. This type of regression is often the best option when you are dealing with ordinal or categorical variables, but you need to covert those types variables into dummy dependent variables for this type of regression. There are other types of logistic regression designed for ordinal and multi-category dependent variables, but they are less frequently used and beyond the scope of this course. 

    • Interpreting logistic regression output is not very intuitive, so why do we need to learn this? You see logistic regression models' estimates all of the time even if you haven't realized it. Many things we want to know about--what factors are most important in determining who will win elections, whether countries go to war under certain circumstances, whether someone has been asked for a bribe--are yes/no variables that require logistic regression. And, as we have learned previously, it is very common to convert ordinal variables--especially Likert scales--into dummy dependent variables. When researchers do this, they use logistic regression.

    • With SPSS's logistic regression output, the "pseudo r-square statistic" (use the one labeled Nagelkerke), is interpreted just like the r-square statistic in OLS regression. So, if a regression model has a pseudo R-square of .0897, it means that the variables in that model collectively explain about 9 percent of what causes the outcome predicted by the model.

    • To identify which variables are most important in explaining the outcome, compare the Wald statistics. The variables with the largest Wald values explain more of the variation in the dependent variable. You may recall that there is a similar statistic for linear (OLS regression): standardized regression coefficients, also called betas. 

    • Remember, we only interpret the specifics of any regression coefficient--including odds ratios--if that variable is significant (i.e., the p-value for the odds ratio is equal to or less than .05). If the p-value is greater than one, you interpret the variable by saying something like, "This independent variable is not a statistically significant predictor; we cannot be confident that repeated sampling would show that this variable will be consistently associate with an increased (or decreased if the odds ratio is negative) likelihood of the outcome."

    • In SPSS output, the odds ratios are listed in the right-most column of the last block of output in the column labeled ex(B). 

    • Independent variables work the same way in logistic and linear regression models. We interpret dummy variables and interactive terms the same way for both types of regression; it's the phrasing and explanation of how these variables influence the dependent variable that is different. Specifically:

    • If an independent variable's odds-ratio is less than one (and statistically significant), there is a negative relationship between the variables. Every one unit increase in the independent variable corresponds to a decreased "likelihood" in the dependent variable happening. One way to interpret odds ratio is to think about starting out with one dollar and then looking at the odds ratio. If you used to have a dollar and now have 70 cents, you could say that you have 30% less than what you used to have could say that you now have only 70 percent of what I once had. If we were predicting who voted in the last election, and an interval variable's odds-ratio was .250, we would say that every one unit increase in that independent variable "reduced the likelihood" of voting by 75%. For a dummy variable, the interpretation is similar, only it notes the reference group. For example, "Compared to eligible voters who are older than 25, individuals who are under 25 were only 25% as likely to vote."

    • If an odds-ratio is between 1 and 2, there is a positive relationship, and we can still convert that odds-ratio into an easy-to-understand percentage. If I used to have a dollar and now I have $1.67, we could say that I now have about 70% more. Or we could say I have two-thirds more than what I had before. In a voting model, if a variable's odds-ratio is 1.545, we would say that every one unit increase in that independent variable "increased the likelihood of voting" by over 50%. Alternatively, we could say that a one-unit increase in that independent variable increased the likelihood by over one and a half times. If we were predicting who voted in the last election, and a variable's odds-ratio was 1.256, we would say that every one unit increase in that independent variable "increased the likelihood" of voting by just over 25%. Again, a dummy variable would need to note the reference group. So, if a model had both a Republican and a Democrat dummy variable, we could say: "Compared to independents (the omitted reference group), Democrats were over 25% more likely to vote."

    • Finally, if we have an odds ratio of 2 or more, we can say that every one-unit increase in the independent variable increases the likelihood of the outcome by x-times. If you were predicting a voting model and a predictor's odds ratio was 2.434, we would say that each one unit increase in this variable, "increased the likelihood of voting" by nearly two and a half times. And if the independent variable is a dummy variable, we would say something like, "Compared to [everyone belonging to the group omitted from the regression model], [people in the dummy-variable group] were 2.4 times as likely to have voted."

  • After class, if you feel like you need more resources to understand logistic regression, you have the option of reading the first half of an undergraduate research methods textbook chapter on logistic regression. It will be faster, and a bit-easier-to-understand reading if you wait to complete this reading until after we have covered logistic regression in class. Do not get hung up on the overly complex explanations. While this outtake is from one of the most-assigned political science textbook in country (this class used it for years), the detailed mathematics in the chapter are no more essential to your competent use of logistic regression than understanding the inner-workings your car engine is necessary for you to be an excellent driver. I am assigning it to you because its explanation of why we need to use logistic rather than linear regression with a dichotomous variable is helpful. Carefully read through the extended example that talks about how we model something like the strength of a person's partisanship and vote choice. The other useful part of the chapter is its explanation for why odds-ratios are the statistic that we use to interpret the influence of each variable in a logistic regression model.

  • After class, if were engaged in class and read the optional textbook chapter material on logistic regression and still feel like you would benefit with more information on the basics of this type of regression, you can optionally watch watch this 12 min. screencast (https://youtu.be/uUf3h8ifZxE). The screencast explains in detail what bivariate logistic regression is, how it works, and why it is often used with ordinal dependent variables after they have been recoded into 0/1 dummy variables. This screencast covers concepts rather than SPSS; there is one below that looks at how we use SPSS to run and interpret logistic regression models. Note that watching either of the screencasts on logistic regression is optional and should not be necessary if you attend class and put your best effort into the hands-on practice exercises.

Topic 16 (Friday 11/22)Let's do and interpret bivariate logistic regression with SPSS and scenarios. This is the last new topic that will be covered this semester. While logistic regression will not be part of your end-of-term SPSS test, your final exam will ask you to interpret a logistic regression output table from SPSS, and you will complete a BlackBoard exercise on this topic, starting on Wednesday.

  • By the end of Wednesday's class: If you have OARS accommodations that you intend to use for the final exam, you need to let your instructor know now! OARS has a deadline in place. If you do not request accommodations in advance, you will not be able to use them during the final period. If you intend to write your research proposal during the final period, you likely will require the full three hours to do A-level work on the test and proposal; students who submit the proposal ahead of time will take only a 70-minute test during the final exam period.  

  • Before Wednesday's class, carefully review the schedule's summary notes for Friday, 11/17, which is when we first discussed the logic behind logistic regression and practiced interpreting odds ratios in logistic regression tables from the Setzler and Yanus article you were asked to read ahead of the classes on linear regression and then ahead of your introduction to logistic regression. If you have any concerns about your grasp on the basics of logistic regression, you should review the optional materials linked above, including a textbook chapter and two screencasts that cover the same ideas we talked about in class.

  • In class, we will continue to practice interpreting SPSS bivariate logistic regression output. If you have not done so already, you should take the time to carefully read and print out: this handout of SPSS logistic regression output with annotations. The document is very similar to the one I posted above for linear regression, but this time the dependent variable--support for torturing terrorism suspects to obtain information--has been recoded into a dummy variable. Respondents who said torture was sometimes, often, or always justified were coded "1"; individuals who said torture is never justified were coded "0."

  • In class, we will practice calculating logistic regression scenarios. As with linear regression, interpreting logistic regression analysis in in an interesting way is best done with "predicted probability" scenarios. While you can quickly calculate linear regression scenarios with a calculator or Google, you will be asked download and use an instructor-created Excel worksheet to estimate the predicted probabilities for logistic regression models. If you do not have Excel on your computer, you can download and use Office365 with your HPU credentials.

  • As I have emphasized in our classwork and discussions on logistic regression, you are NOT expected to be able to explain in any detail how logistic regression works, any of the statistical calculations SPSS uses to estimate "odds-ratios," or the specific mathematical equations that are used to convert an odds ratios into a specific predicted probabilities. For this reason, you were not required to closely review the textbook chapter that was assigned as an optional reading when we first discussed logistic regression. Instead of assigning a detailed readings from a methods textbook, here is a summary of key ideas and concepts that will help you to better understand what you are looking at when you interpret an odds ratio or calculate a "predicted probability"

    • For linear regression, creating scenarios is a straightforward process because it is easy to write out a regression equation (see above for an explanation and examples) to calculate specific scenarios from raw SPSS output. Unfortunately, it is impossible to create scenarios with raw logistic regression output without doing additional mathematical transformations. The key issue with logistic regression is that every one unit increase in a dependent variable is not expected to have the same effect on the probability of the dependent variable. Overcoming this issue requires calculations that involve mathematically transformed odds that are then converted into probabilities. Below, an example looking at the impact of each additional injury on a basketball team's probability of winning will examine how odds can be converted into probabilities.

    • With logistic regression, scenarios involve calculating the predicted probability of doing or believing something at a given value of the independent variable, with the effect of all other variables held constant at their mean.

    • What is a predicted probability, and how is it different than an odds-ratio, the latter of which is listed in SPSS's default output? The answer to this question is explained in the scanned textbook chapter that was optionally assigned when you were introduced to logistic regression. Below is a summary of its key ideas (with simplified examples):

Odds and probabilities are different ways of conveying the same information, and you can mathematically transform any odds statistic into a probability. Specifically, the odds of an outcome refers to the number of times the outcome is expected to occur compared to the number of times the outcome is expected to not occur. Probability is the number of times an outcome is expected to occur compared to the maximum number of times the outcome could possibly occur. For example, if a team has a 50% chance of winning a game, we expect it to win 1/2 of its games. Thus, the odds of the team winning are 1:1, indicating that for every win, the team should have one loss. And, if a team has a 75% chance of winning, we it should win 3/4 of its games, and its odds of winning are are 3:1.

Logistic regression output is typically reported in tables that list "regression coefficients" or "odds ratios," the latter of which are transformed (specifically, exponentiated) versions of the raw coefficients. SPSS reports both values. Without transformation, raw logistic coefficients are not interpretable, and typically articles that report statistical results in this format discuss odds-ratios or predicted probabilities in the body of the article.      

An odds ratio tells us how much a one-unit increase in a given independent variable decreases or increases the odds of an outcome, which can then be converted into a predicted probability. An odds ratio of .600 says that each increase in the independent variable results in the odds of the outcome being .600 times what it was before the increase (i.e., 40% less likely). An odds ratio of 1.20 says that each increase in the independent variable results in the odds of the outcome being 1.20 times what they were before the increase (i.e., 20% more likely).

So, let's say you are planning to bet on a football team that has only 20% chance of losing when it has no injuries. Another way of putting this is to say that the team has 4/5 of a probability of winning. So, for every four times that team plays, we would expect it to lose once, which means their odds of winning is 4 to 1. When you bet on this team, you might place a four-dollar bet. If the team loses, you lose your 4 dollars; if they win, you get a dollar, plus your original four dollars.

What happens if your team experiences injuries during the season? How will each additional injury change the team's odds and probability of winning?

We could run a binary logistic regression model and its output might say that each additional injury a football team will increase the likelihood (specifically, the odds) of losing by 1.5 times. Using that odds ratio of 1.5, here is how the odds and probability of the team winning will change as each additional injury increases the odds of losing by 1.5 times:
0 injuries, odds = 4/1 (probability of losing = 20%)
1 injury, odds = 4/1.5 (pr. of losing = 27.3%)
2 injury, odds = 4/2.25 (pr. of losing = 36%)
3 injury, odds = 4/3.4 (pr. of losing = 45.8%)
4 injury, odds = 4/5. (pr. of losing = 55.9%)
5 injury, odds = 4/7.6 (pr. of losing = 65.5%)
6 injury, odds = 4/11.4 (pr. of losing = 74%)
7 injury, odds = 4/17.1 (pr. of losing = 81%)
8 injury, odds = 4/25.6 (pr. of losing = 86.5%)
9 injury, odds = 4/38.4 (pr. of losing = 90.6%

The reason we go through the hassle of mathematically converting odds into predicted probabilities in logistic regression scenarios is that most people find probabilities a lot easier to understand. Per the results of a hypothetical scenario that will be posted below, the odds of a football team with two injuries winning their next game is .56 to 1, while the odds for a team with six injuries are 2.85 to 1. In other words, a team with six injuries should, on average, lose 2.85 games for every game they win.

For most people, it makes a lot more sense to convey the same information in probabilities, which can be expressed as percentages: A team with one injury has a 27.5% chance of losing, while a team with six injuries can be expected to lose almost 75% of the time.

    • And in order to calculate scenarios like those in the football example above, we first have to transform logistic regression's output for the relevant variables. While you can do these transformations in SPSS, it is very complicated to do so (other statistical programs make it much easier). Fortunately, you can do all of the mathematical work in an Excel worksheet if you know what formulas to use, and a spreadsheet is a good place to manipulate different variable values to create scenarios anyway. Your instructor has put together an Excel spreadsheet for you that calculates predicted probabilities for you, depending on what scenarios you choose. 

What to remember from the screencast:

(1) First, point-click-and-then-paste a logistic regression command that includes all of your independnt and any control variables (plus your dependent variable). Don't run the command yet.

(2) Then, point-click-and-then-paste a descriptives command into syntax, using any random variable. When the command is in your syntax, copy and paste to replace the random variable with all of the independent and any control variables in your regression model syntax (omit the dependent variable). The point here is to make sure that you can run a regression model and create a set of descriptive statistics that will list variable results in the exact same order.

Two things to note that aren't in the screencast. First, you will make things easier if you delete STDDEV from the descriptives statistics syntax before you run it because you will need output only for each independent and control variable's mean, minimum, and maximum values. Second, I ran the descriptive command and then logistic regression in the screencast, but it will make it easier to find the results you need if you run the logistic regression first, because we are interested only very last block of its output.

(3) Select and run both of the commands, and then open up the Excel spreadsheet that your instructor has created to assist you in creating logistic regression scenarios (for the record, you can compute scenarios in SPSS or even using Google or Wolfram Alpha, but it will be much easier to use the spreadsheet I have created for this purpose). That spreadsheet is in the PPT folder and in one of the subfolders in the workshop materials.

(4) Open the last part of the SPSS logistic regression, specifically output block that lists the coefficients, by double clicking on it. Copy and paste the unstandardized logistic regression coefficient (the one in the B column) for the Constant into the appropriate worksheet cell . Then, copy all of the other variable names together with their coefficients (the ones right next to the variable labels in the "B" column), and paste them into the worksheet in the columns that are labeled for this output. 

(5) Now, it is time to work on the scenarios portion of the Excel worksheet. Go to back to the SPSS output and double-click on the descriptives output. Copy the means for all of the variables, and paste them into the worksheet column labeled "Scenario." We are doing this because we want to be able to create scenarios that involve some variables being set to certain values, while the remaining values are set to their mean values.

(6) Now, have some fun creating scenarios. There are two ways that scenarios are frequently used in research, and both of them are used in the paper you read earlier this term on what kind of Brazilians voted for Bolsonaro in 2018, which is why I assigned this article. First, there are a couple of paragraphs in the article that compare hypothetical individuals who are similar in all ways except for a couple of characteristics in order to show which variables had the most effect (partisanship and ideology) and which had a small effect (sharing Bolsonaro's illiberal views). Second, there several bar charts showing how the probability that different kinds of Brazilians voted for Bolsonaro changed if an independent variable was at its lowest versus highest value. Those bar charts were by using minimum and maximum values as scenarios for each variable when all other variables were held constant at average value.

(7) Important and not mentioned in the screencast: If you want to create a scenario for a variable that has multiple dummy variables and a reference category, enter zero for the the relevant dummy variables in the model and one for the dummy variable you are looking at. If the group that you want to look at is the reference category (i.e., it wasn't included in the regression models), then enter zeroes for the other groups.  For the example in the screencast, to determine the probability that a typical independent was going to vote for Hillary Clinton in 2016, the scenario needed to enter zeroes for the variables Democrat and Republican while leaving the mean values for all other variables. Because only partisan dummy variables in the model were Democrat and Republican using zero for each of these variables in the scenario returned the predicted probability of voting for the typical independent if the scenario values for all other variables were left at their mean value.  


End of the term schedule. This year, Thanksgiving falls very late in the semester. Because of this, the week of the the holiday break and afterward will be spent mostly reviewing or completing assessments intended to help you to prepare for exams.

  • Monday November 25. Second practice (ungraded), timed SPSS test on BB. Per above, logistic regression will not be part of the end-of term SPSS practice or final test. The concepts that will be covered on this test will be drawn from the same list I used (see above) for your first practice test. Remember, you may bring a page of notes with you for the SPSS tests. This practice test and the final version will not have detailed reminders of how to use SPSS for different types of problems.

  • Tuesday, November 26, by 5 pm. SPSS assignment on logistic regression due.  

  • Wednesday, November 27, No classes.

  • Thursday, November 28, 5pm. Thanksgiving.

  • Monday, December 2, SPSS test (10% of the course grade).  This test will be very similar in format to the practice test you took before the holiday break. The concepts that will be covered on this test will be drawn from the same list I used (see above) for your first practice test.

  • Wednesday, December 4. Course wrap-up.

  • Your final exam for this class will be when the University has scheduled our exam period. Consult the University's calendar to verify your exact test time. All students will be required to take the third unit exam during the final exam period.