RESEARCH METHODS IN POLITICAL SCIENCE


COURSE HOMEPAGE     STUDENT RESOURCES



Please note: At your instructor's discretion, there may be minor alterations to the reading assignments listed below. One of the major advantages to providing you with an on-line readings archive is that timely articles can be added or substituted when appropriate. Opening documents downloaded from this website will require that your computer have Acrobat Reader . You will also need the class-specific password to open individual files.


UNIT 3 ASSIGNMENT SCHEDULE


Links to helpful resources:

 

Week 12

For Monday (11/3), we will continue with Topic 12, using spreadsheet software to visually explore bivariate relationships between a variable and one or more others.

  • You were previously asked to print out and read this article on the election of Brazilian president Jair Bolsonaro. Ahead of class, take a look at Figure 2. Notice how a simple bar chart can help us to see whether there is a relationship between different independent variables and a dependent variable (e.g., ideology and voting Bolsonaro). 
  • After class, if you need additional guidance on creating Excel charts to visually summarize SPSS reults, you have the option of watching this screencasthttps://youtu.be/T6kHpZ2oReQ. It shows you how splitting data and calculating the means for several different variables will allow you to make a nice-looking chart in Excel to show your results. Making this kind of a chart is a task you will need to do for your next BlackBoard SPSS Assignment.

  • Here are some of the ideas covered in the optional screencast as well as a few pro-tips on creating and formatting Excel bar charts:

  • Before you can create a bar chart in Excel, you first need calculate the stats you need in SPSS:
    (1) Split your dataset by an independent variable:
    Data → Split File → Select the independent variable (e.g., gender) → check “Compare groups.”
    (2) Next, run a Frequencies command on your dependent variable (e.g., whether someone voted for a specific candidate). This will show you the distribution of each independent variable group for the dependent variable.
    (3) You now have the data you need to create a chart in Excel. Keep in mind that you will need to go back to:
    Data -> split file and select the option to "Analyze all cases" to turn off the data split (or you can change the independent variable and then run another frequency)

  • To save time, if you plan to create a bar chart that uses data from several different splits, consider doing most of your SPSS work in syntax. For example, if you want to show the percentage who voted for a candidate when split by gender, race, and then partisanship, use SPSS’s point-click-paste feature to generate a syntax template that splits your dataset by one of your independent variables, runs the frequency for your dependent variable , and then unsplits the dataset (as described above). Once you have that a syntax template for the first independent variable group, you can quickly copy, paste, and adjust template to quickly create the data you need for each of your other independent variables.   

  • To create your Excel chart, enter (or paste) your SPSS results and bar chart labels into a blank spreadsheet:
    The name of each independent variable group goes at the top a separate column. If there is more than one category for the dependent variable, list those categories in rows down the first column. For example, if you were comparing a five-level measure of household income for Democrats and Republicans, your first column would list the five income ranges starting in the second row. Columns two and three would have “Democrat” and “Republican” entered at the top, with the corresponding percentages of respondents in each income category listed below.

  • To create the Excel chart:
    Select all of your your data and labels, and go to Insert → Charts → 2D Column (or Bar).

  • In Excel (will be different if you choose to use Sheets) change the format of the numbers in your chart:
    Format the original data in the worksheet: Select all of the the numerical data that needs to be formatted -> r
    ight-click (PC) or control+click (Mac) → select Format Cells → Number → Decimal places = 0 (zero). This changes chart labels (e.g., from 10.00 to 10) but not the bar heights.

  • To change the numerical range of an axis (say, the auto generated maximum is 90%, and you want it to top out at 100%):
    Double-click
    on the axis you want to change → right-click (PC) or control+click (Mac) → Format Axis → adjust the Minimum and Maximum bounds. If you don't see this setting by default, choose the Format Axis tab that looks like a little bar chart

  • To add an axis label (e.g., add "Percentage of respondents" to a y-axis that only has numbers):
    Click on the center of the chart (a blank part of it so you get the full table rather than an axis or a bar) → select the green “+” (or "Chart Elements") icon in the upper left-hand corner → check Axis Titles → click inside the vertical axis box and type a label (e.g., “% Trump Vote” or “Percentage Supporting Trump”).

  • To add a space between sets of bars that you want grouped (e.g., between your gender and party identification data):
    Insert a blank column or row between groups in the spreadsheet where the data is.

  • To change bar colors:
    Select the bars → 
    right-click (PC) or control+click (Mac)  → Format Data Series → use the paint bucket or color options to choose new colors.

  • To add group subtitles below groups of bars (e.g., add “Gender,” "Partisanship," or "Education level"):
    Highlight the cells above the “Men” and “Women” columns that have their data → Use Excel's Merge & Center option (home tab -> alignment) → type a label (here, “Gender”) in the merged cell Include the labels when you select information and insert the bar chart).


    • After class, start on SPSS #3 in BlackBoard. This workshop will give you a little more coding practice and hands-on practice with interpreting descriptive statistics, making bivariate bar graphs, and interpreting t-tests. Most of this assignment can be completed after this class, but we will be covering T-tests on Wednesday, so leave that part of the assignment for later in the week.  You will need to complete this SPSS assignment by 5 pm this Friday this coming Monday (moved so we could go over calculating T-tests in SPSS on Friday as part of the other work we will be doing) (the assignment will not cover what we will be doing in class on Friday).


    Topic 13 (Wednesday, 11/5)
    Determining whether two or more variables' means are statistically different with T-tests.

    • Ahead of class, carefully read the first sections of Chapter 8, "Bivariate Analyses" in Carolyn Forestiere's textbook. Read just up to the section on correlation analysis (i.e., just the first six pages of the chapter). The first part of the chapter focuses on "statistical significance and t-tests."

    What is a statistical "test" of whether two variables are associated, and how do we know if the findings of that test are "significant"?

    • Why do we want statistical tests to go along with the SPSS work we have been doing so far? When we are working with samples (e.g., a dataset that summarizes a single survey of a sample drawn from a larger population), we need to verify if what we are seeing in our dataset is likely to hold up in repeated samples of the larger population.

    • Statistical significance matters because data analyses always include some random variation. For example, if you flip a coin 1500 times, are you always going to get 750 heads, assuming that the coin is legitimate and not weighted in any way?

    No. Even though the true probability of getting heads is 50%, with a sample of 1,500 flips, you will find that 19 out of 20 times you flip a quarter 1,500 times, the number of heads you will end up with will be somewhere between 712 and 788 heads. If you flip a coin 1500 times, the number of heads you get is a statistic that has a margin of error of 2.53% with a 95% level of confidence.

    If you want to be 99% certain of how many heads you will get if you throw a coin 1,500 times, your margin of error will be bigger. Statistically, each time you flip 1,500 quarters you will get somewhere between 700 and 800 heads, 99 out of 100 times you flip all of those coins.

    • There are many different tests of association, and you will be learning just a few of them: the ones that are most common in international relations and political science. Whenever we run a test—like a chi-square, correlation, t-test, or regression—we get a p-value statistic in our output, which tells us the probability of seeing our results (or something even stronger) if there were actually no real relationship in the larger population.

    • By convention, we typically consider a finding “statistically significant” only when a statistical test shows that there is a very small probability that we would fail to find a similar association in the same direction if we analyzed many additional samples.

    Barring unusual circumstances, the norm in political science and international relations is to highlight findings that have a p-value of .05 (5%), .01 (1%), or .001 (0.1%)—that is, a 5%, 1%, or 0.1% chance that the result would not be replicated in repeated sampling.

    When statistical results are reported in a table or paper, you’ll often see either the actual p-value or asterisks indicating the level of significance. Unless otherwise noted, one asterisk (*) means there is about a 5% chance that repeated sampling would not find a similar result, two asterisks mean less than a 1% chance, and three asterisks mean less than a 0.1% chance that the finding is due to random variation.

    • One important thing to always keep in mind is that "statistical significance" is not the same thing as substantive significance. When you are working with large data samples, even very small differences among groups may be statistically significant. It is not uncommon to find instances where researchers are stressing that a finding is "highly significant" when the importance of that finding is not particularly important.


    What is a t-test?

    This is a test that we use to see if two different groups have different means for a dependent variable. If we just want to see if how different groups of respondents varied in how they answered a survey question, we can split our data and then calculate descriptive statistics for a dependent variable we care about. And it can be helpful to display those differences in bar charts, which can be quickly created in a spreadsheet.

    However, how do we know if any differences we are seeing in our sample are large enough that we would expect that finding to hold for a larger, more general population? To test whether this is the case, we use T-tests. For example, we could use a T-test to see if what we saw in our sample--that more Republicans make a little more money than Democrats--would hold up if we repeatedly surveyed representative samples of Americans.

    When and how do we use independent sample t-tests?

    • Forestiere's chapter talks the most commonly used t-test, an "independent samples test." This test assumes you are looking at whether two groups coded on the same independent variable have statistically different means for a second variable. Staying with the example I just gave, this test only be appropriate if you have a categorical variable where respondents were coded something like: 1=Democrat, 2=Republican, 3=Independent, and 4=Other party.

    • Optional: Watch after class if you need more guidance on calculating independent samples T-tests: https://youtu.be/KADpYio2W3U (a little over five minutes). In the example video a T-test is used to determine if the mean level of support for torturing terrorism suspects is different for Democrats and Republicans. 

    Some things to remember from the video:

    (1) To run an independent samples T-test; Analyze -> Compare means -> Independent samples T-test. Then, select a variable whose mean you want to examine across several subgroups. 

    (2) You next need to specify which values of the grouping variable will be compared (click on the button that says "defined groups). In the sample video, Republicans were coded one; Democrats two, and Independents 3 in the original dataset. To compare the means of Republicans versus Independents, the values 1 and 3 would be specified.

    (3) Make sure that you are looking at the correct block of results and the correct column to determine if the difference in means in statistically significant. The significance test you want is in the bottom block of output (not the "Group Statistics," but rather in the block "Independent Samples Test." In that block, look at the top row of results ("Equal variances assumed") and find the column labeled Sig. (2-tailed). To repeat, the one you are looking for is in the row for "Equal variances assumed."

    (4) Only if two-tailed significance statistic is SMALLER than .05 can we say with any confidence that the mean values for the two groups are statistically different and that we would reach the same conclusion if we drew repeated samples from the same population; conversely, a significance statistic that is LARGER than .05 indicates that the two groups do not have statistically different means.


    When and how do we use single-sample t-tests?

    • Forestiere's chapter does not discuss a second type of t-test, the single sample t-test, which can do everything an independent-sample T-test does and more. Since she doesn't cover this second type of T-test, please read through the paragraphs below very carefully.

    • A single sample t-test determines whether the mean value for a group on a dependent variable is different than the mean value for another group (or groups) that are measured by other variables.  

      Your textbook refers to the "DataPrac" survey, which is a dataset that comes with your textbook so that you can practice methods discussed in the book. We are not using that dataset this semester; however, if you were to analyze the DataPrac survey's variable D72, you would see that the typical American (i.e. the survey mean) had a response value of 7.17 on a 10-point scale that measures religiosity. For this indicator, respondents were asked to place themselves on a scale where 1 means that God is "not at all important" in their life and a ten indicates they believe God is "very important" in their lives.

    Is the sample's mean value for this religiosity different than the mean for individuals who say they plan to vote for the Democratic candidate in the next election? How about Republicans? Is the religiosity of men lower or higher than the national average for this item? How about women? And what if we we didn't want to compare the average religiosity of women with that for the national as a whole, but instead wanted to compare their average with the mean for Democrats?

    We can answer each of these questions and even build a table or bar chart comparing them if we split our data on each of the relevant independent variables and then run a single sample T-Test (Analyze-> Compare Means -> One Sample T-test) for each group we care about. For most of the tests, we would enter the average for the sample as a whole, 7.17, into the place in SPSS that asks for the "Test value." Or compare women to Democrats, we would separately calculate the mean religiosity for women and use that as our test value,

    So, if were split the data by the DataPrac variable D14 (partisanship) and run a one sample t-test in SPSS with a test value of 7.17, we see that the mean score for Democrats on the importance of God in their life is 6.60, and the two-sided significance T test reports that the difference for the test value and the mean for Democrats is significant at the .001 level. The same test shows that the average Republican score for the importance-of-God measure is 8.25, which is significant at the <.000 level. In other words, if we survey similar samples 1000 times, we would always expect to find that the average value for Democrats on this variable was always lower than the national average. And separate t-test (because the partisanship variable was split) shows us that the average for Republicans should always be higher than the national average.

    To see if men or women also are different from the national average for this religiosity variable, we would just need to go back to Data-> Split File -> Compare Groups and swap out the variable D14 for the gender variable (we would leave the test value at 7.17, the average for the full sample).

    • Optional: Watch after class, if you need more guidance on calculating an one sample t-tests: https://youtu.be/paUIJ3Eh7JI (a little over five minutes). In the video, a test is run to see whether the average for the variable male (coded 0/1) is different than the value of .50, which is about the percentage of men we would expect to find in a nationally representative sample (e.g., it would be due to something other than sample error if we were to find that 60% of a 3000-person random sample was male.

    Some things to remember from the video (so you don't need to watch it more than once... or maybe even at all):

    (1) To run a one sample T-test; Analyze  ->Compare means -> One sample T-test. Then, select a variable whose mean you want to examine.

    (2) You next need to specify a "test value" to which you want to compare the mean for your variable of interest. In the video, the mean for the variable male is .54, which is compared to an expected value of .50. Per the commentary above, the test value of .50 was used to see if there are more males in this sample than one would expect to find in a nationally representative survey.

    (3) Only if the significance statistic for the "two-sided p" result is SMALLER than .05 can you say with any confidence that that the mean value observed in the sample is truly DIFFERENT than the expected value; a significance statistic that is LARGER than .05 indicates that the the observed and expected mean are not statistically different. In this particular example, a value greater than .05 would mean that the sample's portion of men is not larger than the 50% would expect to find in a nationally representative sample.

    (4) It is not covered in this screencast, but remember from the religiosity example above, if you want to test whether the mean for a subgroup (perhaps women, or Republicans, or Catholics) is different than some value (maybe 50%, or the sample average, or the mean of some other group you care about), you can split your data to isolate the subgroup and then run the T-test).


    Topic 14 (Friday, 11/7)—Determining whether and how much two variables are "associated" with one another 

    • SPSS 3 assignment due by 5 pm this Friday this coming Monday (moved so we could go over calculating T-tests in SPSS on Friday as part of the other work we will be doing) (the assignment will not cover what we will be doing in class on Friday).

    • Before class, please finish Chapter 8 in your Forestiere textbook (you previously read up through the section on t-tests, so start on the section for correlation). Please read the political science examples carefully. 

    • Before class, but after you have read about correlation tests in the textbook, read through the block of material below carefully and quickly read a very short reading on what a chi-square test is to give you a clearer idea what SPSS is doing when it runs this kind of an association test. Chi-squared tests are not covered in your textbook, so you need to review this statistical measure in the material below and the assigned, short reading.

    • Before class, print out and have handy this annotated SPSS output for a Chi-Square test. In the sample, the researcher is trying to see if a Brazilian's race (a categorical variable) had anything to do with whether or not the voted for the politician who was elected president in 2018.

    • In class, we will continue to focus on three of the many methods social scientists use to determine how two variables are associated: chi-square tests, correlation tests, and bivariate regression.

    • After class, please read the first 15 pages or so of Chapter 9, "Regression," in your Forestiere textbook (up to page 199). Review the section on regression with one independent and dependent variable. It will be faster reading if you wait to complete this reading after we have begun to discuss regression in class. We may well not begin to cover regression in Friday's class, but I want to separate out your reading so that you are not being asked so read an excessive amount for next week.


    F
    or X2 (chi square) and other association statistics, here are the key concepts you need to remember from class:

      • Because they are the most-used association tests in political science and international relations research, we will spend quite a lot of time focusing on correlation and regression. They are the only association measures covered in any detail in the Forestiere textbook. 

      • Other than correlation and regression, the only association statistic you need to be familiar with for this course is the chi-square (x2) test.

      • A Chi-square test is what we use to see if any combination of two nominal (categorical) and ordinal variables are associated with one another. For example, you might wonder if a person's race or political party is associated what major religious denomination they belong to.

      • To calculate a Chi-square test, use SPSS's Analyze -> Descriptives-> Crosstabs, If you think that one one variable is the cause of the other, the independent variable (the cause) typically goes into the rows, while the dependent variable should be listed in columns window. To make the table useful, go to the "cells" option, and check only the box for "rows." Check also, the boxes for "observations" and "percentages." Then, in the "statistics" option, check the box for a Chi-squared test. If you need more guidance on this procedure, you can watch this screencast: https://youtu.be/7O3UTYL2A-I (about

      • To get a basic understanding of what a chi-square test is and how association measures work, read carefully just the first seven pages of this document (Read up to the section "residuals"). Here is a summary of what the reading says, with a simplified example:

    The main point of the reading is that a chi-square test provides a statistical test to determine whether any association between two nominal/ordinal variables we see in our sample data is due to chance. In other words, what is the probability that repeated sampling would find that a respondent's category for one variable would have nothing to do with their category in a second variable.

    An example can provide a basic idea of what a chi-square test looks at. Let’s say that that we have 1000-person sample where exactly half of the individuals have identified as women and half as men. This being a sample from an odd, hypothetical US state, we also have a sample with exactly 50% Democrats, 50% Republicans, and no independents.

    If gender has no association all with partisanship in our sample, we would expect to see that 25% of our sample is made up of female Democrats, 25% female Republicans, 25% male Democrats, and the final 25% male Republicans.

    However, a hypothetical analysis our sample might reveal that 30% of the respondents are made up of female Democrats and 30% the sample is male Republicans. Thus, in our sample, it looks like there is an association between gender and partisanship (specifically, more women than expected are Democrats and fewer are Republican).

    The chi-squared test will tell us (and ONLY tell us) whether the association we see in the sample between gender and partisanship is possibly due to sampling-error chance. Specifically, the p-value for the chi-squared will tell us what is the probability that repeated sampling would sometimes find that women are more likely than men to identify as Republicans which is contrary to our hypothesis and the finding in our sample.  

    As suggested above in the explanation of "statistical significance" a p-value of .05 or smaller for a chi-square statistics tells us that there is only a 5% chance or less that the association we are seeing is our sample is due to chance (i.e., survey error, which a function of sample size) and that we should expect repeated sampling to show a similar association at least 95 percent of the time.

    And, given the magnitude of gender differences in the hypothetical sample above and its size (n=1000), the chi-square test would be significant in this case.

    However, as is the case with statistical techniques generally, if you are using a very small sample or looking at a variable where you have very few individuals in some the response categories, a chi-square test may not return a statistically significant result. This is why it is important to run frequencies on variables and think carefully about whether response categories should be combined (e.g., it is very common to see a multi-racial measure be recoded into a white/non-white dummy variable before analysis if the sample is under 600 or so resp0ndents).

    If you want to be a competent consumer of social science research, you should be aware that there are other statistical methods that can provide more accurate tests of association when you are looking at the relationship between any specific combination of two categorical, dummy, or ordinal variables. If you are curious, here is a summary of the association tests SPSS can quickly compute: https://www.ibm.com/docs/en/spss-statistics/25.0.0?topic=crosstabs-statistics. We are learning only about chi-square tests because they are widely reported in both academic and everyday publications.

    Bivariate Correlation

    • While chi-squared tests are common in the social sciences, most political science and international relations research instead examines associations between categorical variables by creating dummy variables and running correlation tests among them. For example, rather than calculating a chi-squared test to see whether a three-category measure of party identification (Democrat, Independent, Republican) is associated with a five-category measure of religious affiliation (e.g., Catholic, evangelical Protestant, mainline Protestant, secular, other), many social scientists create a “correlation matrix” using party and denominational dummy variables. This approach allows researchers to see how consistently adherents of each denomination predict a person’s political identity and whether those relationships are positive or negative. For instance, identifying as secular would likely be positively correlated with being a Democrat, while identifying as an evangelical Protestant would likely be negatively correlated with being a Democrat.

    • Correlation tests can be used to test associations between any combination of interval (aka continuous) and dummy variables. If one of your variables is an ordinal measure, that measure can be can analyzed with correlation if the ordinal measure is treated as an independent variable or recode into a dummy variable (the choice of which should be based on either logic or the distribution of responses across the ordinal measure).

    For correlation statistics, here are the key concepts you need to remember from the chapter and class:

      • To run a correlation in SPSS, use Analyze -> correlate -> bivariate. If you enter more than two variables, you will get a "correlation matrix," showing you the relationship between each pair of variables.

      • The correlation statistic (aka Pearson's R) measures only how consistently the value of one interval variable corresponds with the value of a a second variable (note that it is common to use correlation with dummy variables).

      • Correlation measures should interpreted with close attention to whether or not they are statistically significant. If the p-value is higher than .05, there is no association regardless of how large the correlation statistic is. For correlation, the p-value tells you the probability that any association between what has been found in the sample could be zero or signed in the opposite direction in repeated sampling. To say that one variable is a statistically significant predictor of another, the p-value needs to be .05 or less. In SPSS, make sure to look at the p-value even if you see two asterisks. For some odd reason, the default setting in SPSS only adds two asterisks to coefficients even when the p-value is <=.001, which should be denoted with three asterisks. 

      • Correlation does NOT tell you how much a change in one variable changes the other variable. It also cannot tell you which variable may be causing the other to move. For example, being more conservative is correlated with being more religious, but there are theories to suggest that causality could go either way.

      • Moreover, even if two variables are highly correlated, it could be that there is a third variable that is causing both x and y to change in predictable ways even though those two variables have no actual relationship. Social scientists often use the term spurious to describe a situation where the mathematical correlation between two variables is due entirely to a third variable. For example, in the US, violent crime goes up in the same months that ice cream consumption also goes up in a population, but they don't have anything to do with each other except that both are more prevalent on hot summer days. "Omitted variable bias" is one of the reasons we will be talking about multivariate regression models next week.

      • Most of the association statistics range from -1 to 1, and a negative correlation statistic means that increased values in one variable is associated with declines in the other variable. Typically, positive correlation statistics are not marked with a plus sign.

      • The square of the correlation coefficient (r-squared), is used to estimate how much of the variation in one variable is "explained" by the other one with a key caveat noted above: a missing variable may be explaining some or all the variation... which is why we typically look at relationships between two variables with multivariate regression that includes one or more "control variables" (more on that next week).

    As a general guideline for thinking about the association statistics, like correlation:
    <.10 means that there is a very weak or no association between the variables;
    .20 can be interpreted as a meaningful but modest association;
    .30 is a moderate association, and
    >.40 is a strong association,
    But in every case, you need to put these findings into context (e.g., a .40 association between being Republican and being conservative would be a much weaker finding than you would anticipate, so it wouldn't make sense to refer to this scenario as being evidence of a very strong association).

      • Correlation assumes that the association between two variables is linear and the same for different values of the variables.

    There can be a close relationship between two variables, but correlation won't measure it if the relationship is not linear. Think about a person's age and their physical independence. For a while, each year mean more physical independence, but at a certain point, the relationship become negative; there is a curvilinear relationship between age and physical independence. Calculating a correlation measure for this type of relationship, will miss the connection.

    With income and happiness, increases in income initially push happiness up consistently. However, at a certain point (about $100K annually in today's dollars), more income doesn't seem to improve or decrease happiness. (if you are interested, the full article archived here). So, the relationship looks sort of like a backwards 7. Economists refer to this particular relationship as a diminishing returns curve.

    And then, there's the relationship between time and the growth of invested money, AKA, the miracle of compound interest, which looks kind of like going along a road with a small climb and then starting up a mountain.

    There are advanced ways to test these kinds of non-linear but clearly meaningful relationships, but those are usually covered in graduate-level statistics courses. Fortunately, we can use tools we already have learned to explore non-linear bivariate associations. For example, to see whether there’s a diminishing-returns pattern in income’s effect on happiness, we could divide a sample of Americans into seven household income categories and calculate each group’s mean on a five-point happiness scale. A bar chart of these means would likely show that happiness increases as income rises, but by smaller amounts at higher income levels. We could then run a few t-tests to see where the differences in mean happiness between income groups stop being statistically significant.  

    For linear regression with one independent variable (i.e., bivariate regression), here are the key concepts you need to remember from the chapter and class:

      • You must specify a dependent variable when using regression, but this doesn't mean x actually causes y. Running a regression model will not tell you if the dependent variable is the cause of y. Use theory or logic or logic to determine what the dependent variable is.  

      • Linear (i.e., OLS--Ordinary Least Squares) regression models report an R-square statistic, which is interpreted as noted above in the section on correlation. An R-square statistic of .35 means that the independent variable in the regression model explain 35% of the variation in the dependent variable (and doesn't explain 65%). You will get some sample language on reporting R-square results in the section on regression with multiple variables below.

      • Regression estimates how much a one unit increase in the independent variable will correspond to changes in the dependent variable. Specifically, regression output includes a slope measure for each independent variable. This statistic is called the unstandardized regression coefficient. In SPSS output, unstandardized coefficients are listed in the "b" column of output; make sure you are looking at the first column output in the last block of output). This regression coefficient tells us how much "each one-unit increase in the independent variable corresponds with an x-unit increase (or decrease if the coefficient is negative) in the dependent variable." In plain English, we might say each one unit-increase in the 10-point measure of religiosity, corresponds to a 1.34 point increase in an individual's measure on the 10-point ideological-conservativeness scale."

      • With regression, there is a statistical test for each regression coefficient where the p-value tells you the probability that relationship between what has been found in the sample could be zero or even run in the opposite direction in repeated sampling. To say that one variable is a statistically significant predictor of another, the p-value needs to be .05 or less. These models include a value for the y-intercept (in the output, this is the unstandardized "constant).

      • With regression results, you can predict the value of the dependent variable at selected values of an independent variable. The constant (aka, the y-axis intercept) can be used to predict the value of the dependent variable for a given scenario with a simple formula: DV value = Constant + (a specified value of the IV times the regression coefficient). If the IV is a dummy variable, the language used to interpret its regression coefficient is: "Compared to the reference category of (carefully describe anyone who is not in the group), individuals who are in the group had an x-point higher (or lower if the dummy variable coefficient is negative) value on the dependent variable. In plain English, this might sound like, "Compared to non-Republicans--that is, Democrats plus independents--Republicans' score on the 10-point measure of religiosity was 2.4 points higher."

      • As with correlation generally, regression models assume that every one unit increase in the independent variable will have the same effect on the dependent variable. This is referred to as the assumption of linearity. Examples of how variables can be related with one another but not have a linear relationship include time and investments (over time, investment returns compound so growth is exponential) and the curvilinear relationship between age and physical independence. Regression can handle these types of relationships in a few different ways, one of which is using dummy variables and interaction terms (this isn't the most common way, but it is the only way that fits neatly with concepts you already are going to learn in this class). If we suspected that age has a different effect on income, a series of dummy variables (say, reference group = under 35, with additional dummies for 35-50, 51 - 65, 66-75, and over 75 years old) likely would show that wage-earned income, on average, quickly increases as one moves  . 

      • Completely optional: if you have attended class and carefully read the textbook  material on correlation but feel like you would like to go over the basics of this method one more time, you can watch this 25-minute (12.5 at x2 speed) screencast presentation covering the  logic and main concepts of correlation: https://youtu.be/pjDDBrunB1A. Note: The screencast goes over the same conceptual material we will have reviewed in class, and doesn't cover the use of SPSS.

      • Completely optional: if you have attended class and carefully read the textbook chapter material on bivariate regression but still feel like you would like to better understand the basics of this method, you can watch this 19-minute (10 at x2 speed) screencast presentation covering the basics and logic of bivariate regression: https://youtu.be/K8A6xGIXPR8. Note: The screencast goes over the same conceptual material we will have reviewed in class, and doesn't cover the use of SPSS. 


    Week 13

    Topic 15 (Monday, 11/10; Wednesday 11/12, and Friday 11/10)—Linear regression with multiple independent variables, including dummy and interval variables

    Monday's class will be spent reviewing correlation and linear regression. On Wednesday, we will spend more time on regression, leaving Friday so that we can practice calculate scenarios (a term that will make more sense by mid-week).

    Important: The concepts and SPSS work we cover this week will be the last material that will appear on your SPSS test at the end of the term (logistic regression--which we will be learning about next week--will be a topic covered on your last BlackBoard workshop and on the final exam).

    • SPSS 3 assignment due by 5 pm Monday (moved so we could go over calculating T-tests in SPSS last Friday.

    • SPSS #4 (posted to Blackboard) is due by 5pm on Wednesday.  This is the assignment that gives you more practice coding dummy variables and running/interpreting T-tests. The assignment also covers the main concepts behind Chi-squared tests and correlations as well as how to run these analyses and interpret their results in SPSS. 

    • Ahead of Monday's class, starting with page 199, read to the end of Chapter 9, "Regression," in your Forestiere textbook (17pp). 

    • Ahead of Monday's class, print out this document ahead of class: A handout of SPSS linear regression output with annotations. The handout matches up with the same topic as the screencasts (using SPSS to predict the level of support for torturing terrorism suspects, as measured by a 7-point likert scale (1 = "never justified"; 7 = "always justified") that is being treated here as an interval variable. You should retain your copy of this document because you will find it handy when you complete the BlackBoard assignment on regression. 

    • Ahead of class, take another quick look at this article on the 2018 Brazilian Election. Read the abstract and intro quickly and then go straight to the methods and findings section. Read through the intepretation of the logistic regression results. 

      The reason you are being asked to quickly read this particular study again is because it has several dummy dummy dependent variables, so we will be able to use the same example study when we look at logistic regression in the next block of course materials. For both linear and logistic regression, the examples I show with SPSS in class will be based on this article and its dataset.


    Below are summary notes for the major concepts covered in class and your textbook to understand how multivariate linear regression works and is interpreted:

    General concepts for multivariate linear regression (i.e., more than one independent variable), Here are the key ideas you need to remember from the textbook chapter on regression and class:

    • All of the key concepts listed above for bivariate regression (i.e., one independent variable) apply to multivariate regression, too:

      The R-Square statistic, constant, statistical significance statistics, are interpreted in the same way. All regression assumes that each independent variable has a linear effect with the dependent variable (see what this means by reviewing the notes under correlation and bivarate linear regression that explain the "assumption of linearity"). 

    Also, the individual unstandardized regression coefficients are all interpreted a similar way as they are with bivariate regression except that results for each independent variable caculates how much a one-unit increase in that independent variable increases/decreases the value of the dependent variable when the influence of all other variables in the model is held constant at their mean value. For example, if you are looking at the effect of each additional level of education on income and the only other independent variable in the model is the dummy variable Male, the regression results for the education variable would be calculated with an equation that controls for the effect of gender by calculating .5 x the positive effect of being a male (i.e., the regression coefficient for the gender variable). Why .5? This is the mean value for male.

    The interpretation of multivariate regression is a bit more completed when you have interaction variables (i.e. MaleXRepublican) or multiple dummy variables for the same variable (e.g., race or political party dummies). These type of variables are discussed below. 

    • As with bivariate regression, multivariate regression results allow you to predict the value of the dependent variable under different scenarios. The way this is done is to assign the scenario values for any scenario you want and independent variables' mean values otherwise. More details on this below, but this is the key idea for how scenarios work. 

    • With multivariate regression, you determine which variables are most important by comparing their "standardized regression coefficients." These are also called "betas" are located in the SPSS output column labeled as such. Recall that we can compare standard deviations of different types of measures in useful ways. So, comparing a 34 score on an ACT to a 1500 on an SAT is not easy, but comparing how many standard deviations each of scores is from their test mean would tell you that that ACT score represents relatively higher performance. A beta tell us how much each one standard deviation in an independent variable increases/decreases the dependent variable's value, measured in standard deviations. In other words, the farther away a given independent variable's beta is from 0 (beta can be positive or negative), the more important that particular independent variable is in predicting the dependent variable's value.garding the use of dummy variables,

    • With multivariate regression, there is the added assumption that each of the independent variables is at least somewhat independent of the others. If two or more independent variables are very highly correlated, the statistical results in the model may be able to determine how changes in those two variables influence the dependent variables. See the note above on multicollinearity. 

    To use and interpret regression output with dummy variables, here are the key concepts you need to remember from class and your textbook:

    • Critical: to interpret any dummy variable in a regression model's results, you have to know what the variable's reference category is

    • Whenever you interpret a dummy variable, your interpretation should explicitly identify the reference category. For example, if a regression model only includes the variable Latino, the interpretation of that variable's regression results for this variable will start with this phrasing like, "Compared to the typical non-Latino, the estimated household income of a typical Latino is $1,300 less a year after controlling for the other variables in the regression model."

    • If there is just one dummy variable in a model and it was coded from an original variable that had just two response categories, the reference category is easy to identify. For example, we might have data where respondents have been coded one if they believe freedom is more important than equality and zero if they think equality is more important than freedom. If the regression model includes the dummy independent variable FreedomIsMostImportant, that variable should be interpreted with phrasing like this: "Compared to respondents who prioritize equality, those who think freedom is more important had an x-point higher score on the 10-point dependent variable measuring y."

    • If a regression model includes a dummy variable derived from a multi-category original variable, think carefully about how many of those groups have dummy variables in the model and thus what the proper reference category is. Consider a (logistic) regression model looking a persons' characteristics to predict how much they think NATO is important to international security, measured on a 5-point scale. Let's say this model includes "Democrat" as its only partisan dummy variable. If so, the reference group when interpreting the Democrat coefficient is non-Democrats (i.e., since both independents and Republicans are not in the model, both groups are the reference category). In this example, the variable Democrat should be interpreted with phrasing like this: "Compared to Republican and independent respondents Democrats x-point higher score on the 5-point indicator for seeing NATO as important to international security."

    • If you are running a regression model to test a hypothesis comparing two groups, one of those groups needs to be the reference category. Sticking with the example of looking at how partisanship shapes voting for female candidates (something Dr. Setzler has written a lot about, incidentally), if we want to compare the likelihood of Democrats of voting for a female candidate to Republicans' likelihood, we would need to add a second dummy variable to the regression model for the people who are independents. Once the regression model included dummy variables for both independents and Democrats, those variables regression coefficients could be compared to Republicans, who would be the omitted reference group.

    To use and interpret interactive-term variables, here are the key concepts you need to remember from class:

    • We create and use an interaction term in a regression model when we think that the influence of one independent variable on the dependent variable depends on the value of a second independent variable.

      For example, we might want to analyze how being a political science major and the number of hours of studied before an exam in general education political science classes influences the typical student's test score.

      Let's say we collected a year's worth of survey data on students' study habits and their test scores in these types of classes. If we were to calculate a regression model with just these two variables, we presumably would find that both factors are significant predictors of higher test grades but that there are lots of other factors, too (i.e., our model probably wouldn't have a high r-square statistic).

      If we saw that both being a major and studying improved test scores, we might wonder if the effect of each hour of additional study is different for PSC and non-PSC majors. Maybe, studying pays off more for non-majors because they have less of a background in the subject area and more to learn. On the other hand, maybe studying pays off more for PSC majors because they have more interest in the subject and are better able to retain information about it.

      To test either of these hypotheses, we need to create and add an interaction term to our regression model.

    • To create an interactive term variable, you just create a new variable that multiplies together each respondent's value for the two relevant variables. In the example above, we would use SPSS to create a new variable with coding that looks something like:

    COMPUTE NewVariable = first_IV * second_iv

    So, in this case:

    COMPUTE PSCxHrsOfStudyBeforeExam = PSC * HrsOfStudyBeforeExam.

    The"interactive term" would then be added to the regression model along with the two variables from which it was formed (both of the original variables and the interactive term must stay in the regression model). And then we would rerun the regression model and look at our results.

    • If an unstandardized coefficient for the interactive term is positive and significant, we know that the combination of the two variables has more of a positive effect on the independent variable than just the additive effect of each variables. In the example, this would mean that each hour of study is paying off more for PSC majors than non-majors.

    • If an interactive term's coefficient is negative and significant, it means that the combination of the two variables has less than the full effect we would expect if we were to add their full effect together. In the example, this would mean that the effect of each additional hour of study is less for PSC majors than non-majors.

    • If an interactive term's coefficient is not statistically significant, the value of the second independent variable does not influence the relationship between the first variable and the dependent variable. In the example, this would mean that there is no difference in the grade improvement payoff of additional studying for PSC majors and non-majors.


    In Wednesday's class (or Friday's if we are running behind), we will practice calculating scenarios with regression output.
    What is a scenario, and what key concepts do you need to remember from class?  The standard way to talk about the effects of different variables in regression models is to use a regression results table to explain how a one-unit increase in a given independent variable changes the value of the dependent variable when the effect of all of the other variables is set at each variable's mean value. While that makes for nice, succinct tables, discussing situations that compare different types of hypothetical individuals who are otherwise similar can help to bring regression results alive and make them more useful. For example, if we were analyzing how much different kinds of people support the legalization of marijuana, we could calculate a regression model looking at the influence of gender, religious denomination (i.e., dummy variables for several of them), and age with controls for a person's income and education. Using the model's results, we could compare the level of support for a 30-year old male secular versus a 60-year old female evangelical Protestant with comparable incomes and educational backgrounds.


    • How do you calculate a regression scenario? To create a regression results equation, first identify the value of the unstandardized coefficient for the "Constant" (use the value listed in the B column of the SPSS results output box--the last one--that lists each variable). Then, to that constant value add the effect of each independent variable (i.e., its unstandardized regression coefficient) multiplied by a value you specify. If you want to control for any variables, their specified values are their mean for the full sample.

    • Here's an example of a complex scenario (only because of the number of variables is pretty high), The example uses the descriptive statistics and linear regression models reported in the Setzler and Yanus paper you were asked to read ahead of class. If you hadn't just read that study, I would have used a simpler example.

    Let's use the article's regression tables and descriptives to create scenario comparing two people's score on the 7-point measure looking at a person's indifference-to-gender-equality (as measured by how unimportant a person thinks it is to fight for gender equality). What is the predicted score for an older Republican male without a college degree who is otherwise similar to other Americans (at least on the variables in the model)? What is the predicted score for a female non-Republican (i.e., independent or Democrat) who is otherwise identical to her male Republican peer?

    To calculate the expected indifference-to-gender-equality score for the Republican male, we would use the following formula:
    The model's unstandardized constant (1.046)
    + 1 times the unstandardized regression coefficient for Republican (i.e., the scenario)
    + 1 times the Male coefficient
    + 1 times the No college degree coefficient
    + 1 times the aged 45 and older coefficient
    + 0 times the age 30-44 year coefficient
    And now we'll add in the controls, using their mean values in the scenario):
    + .792 times the White coefficient (i.e., 79.2% of this sample is were white)
    + .502 times the Religiosity coefficient
    + .441 times the Blue collar coefficient
    + .399 times the Rural coefficient
    + .500 times the Authoritarianism coefficient
    + .294 times the Racial animus coefficient
    = The individual's expected value on the 7-point indifference to gender inequality measure.

    Formula in hand, you can plug the whole equation into the super-cool online computer at https://www.wolframalpha.com/:

    1.046 + (1 * .912) + (.792 * .119) + (1 * .444) + (1 * -.097) + (1 * -.246) + (0 * -.078) + (.502 * -.082)+ (.441 * .190) + (.399 * .013) + (.500 * -.552) + (.294 * 2.69)

    WolfFram Alpha tells us that all other things being typical, an otherwise typical older, Republican male with no college degree has an estimated gender-equality-indifference score of 2.71.

    Using the same equation in WolframAlpha and changing just the scenario values for Republican and Male variables, we see that the expected value for the female non-Republican is 1.36 on the 7-point scale. That's just about half the value we found for Republican males.
    • Scenarios are helpful in understanding and visualizing the effects of interaction terms. In the example above, we might theorize that that being not having a college degree makes men particularly inclined towards sexist views. This is another way of saying that the effect of educational attainment on sexism is different by gender. To test this hypothesis, we would create an interactive term (here, the dummy MaleXNoCollege) and add it to our regression model along with the variables Male and NoCollege. In calculating our scenario, our equation would include: (1 x the coefficent for Male) + (1 x the coefficient for NoCollege) + (1 x the coefficient for MaleXNoCollege). The scenario for value for MaleXNoCollege is equal to the scenario values for Male and NoCollege multiplied together.    

    Finally, I have compiled several screencasts on linear regression that cover the same topics we will go over in class. If you feel like you need additional guidance, check out the optional resources below. Watching any or all of the screencasts should not be necessary if you have attended class and put your best effort into the hands-on practice exercises:

    • Optional: After practicing in class, if you still feel like you guidance on calculating and interpreting linear regression models in SPSS, review this 11 min. screencast that https://youtu.be/xzl8OxPsM8s.
    • Also optional: This 13-minute screencast follows up on the last one with a focus on how you use SPSS to analyze dummy variables and how output/tables with these variables are interpreted: https://youtu.be/I2BEi_CkzK0.

    • Also optional: https://youtu.be/3m66P8PaD3U. In about ten minutes, this presentation explains how we can use linear regression output to predict the value of the dependent variable with different scenarios involving the independent variables (something I just covered in the example above). The standard output in regression will tell us how a one-unit or one standard deviation increase in an independent variable will change the dependent variable when the effect of all of the other variables is set at each variables mean value. While that makes for nice charts, scenarios can help to bring the data alive. For example, what is the level of support for torture on a seven point scale for a 60-year old Republican male who attends church a lot versus a 30-year old female secular Democrat? Here is a handout with the output and calculations covered in the video.

    • One important thing not covered in the last screencast is how control variables work in regression scenarios. As noted above, including controls is straightforward. Let's say that we wanted to calculate the same regression model and scenarios the screencast, but control for the effect of a person's education. If education were a five-point measure, we would calculate its mean for the full sample and our scenario equation would include one more added component: mean of Edu5 multiplied by this variables unstandardized regression coefficient. 


    Friday, November 14. Class time will be used to finish up our overview of linear regression and--if time is available for you to work on SPSS #5 in Blackboard.


    Sunday, November 16, by 5pm: SPSS #5 in BlackBoard is due. This assignment is focuses on linear regression, which is the last topic that will be covered in your practice and final SPSS tests. 

    Monday, November 17: First practice SPSS exam. This is the first of the two mandatory, but ungraded SPSS exams to prepare you for the graded exam that you will take later in the semester. Remember that logistic regression will not be part of the end-of term SPSS practice or final SPSS tests.  You may bring a page of notes with you for the practice and final SPSS tests.

    For the practice and final version of the tests, you could be asked to any of the the following exercise (i.e., you may not bed asked to do all of these things on any one test, but you will not be asked to do anything that is not listed below):

    • Create a new variable. You could be asked to create a new dummy variable that combines information from either one or two original variables. You should be able to label the new variable and its and response categories.

    • Split a variable into its subgroups and compare the frequency at which different subgroups for that variable have a particular opinion.

    • Make a bar chart in SPSS (not Excel because of concerns about how long that might take). Your bar chart must show the percentage of one of single variable's subgroups that have a particular opinion (e.g., what respondents in households that make $10K or less think about a an issue).

    • Compare two or more variables' means and standard deviations, explaining what the standard deviations tell us about the distribution of each variable. For example, comparing the typical and range of opinions expressed by Republicans and Democrats about whether "God has given the US a special role in human history."

    • Statistically test whether the typical individual in each of two subgroups that are coded on the same variable are statistically different from one another with respect to a specific attitude (For example, does the typical African American think that "college is a gamble" (versus being a smart investment" than the typical Latino; is any difference statistically significant)? Hint: you need an independent-samples t-test to answer this type of question.

    • Statistically test whether the share of a specific group in the sample is statistically different than what it should be given a known parameter for the US as a whole. For example, African-Americans make up 12.1% of the US population. What percentage of this sample is African American? Are African Americans underrepresented in the sample (when analyzed without the survey's weights turned on). Is any difference between the share of African Americans in the sample and what the percentage should be statistically significant. Hint: To test this, you need to split the sample by race, use a one-sample t-test, use .121 (12.1%) as the test value, and look at the output for African Americans only.

    • Statistically test whether two subgroups that are coded on two different variables are statistically different from one another or Americans in general with respect to a specific attitude. Hint: To answer this question, you again would need to split your dataset and use a one-sample test. The test value will be the mean for another group or for the sample as a whole, which would be need to calculated separately unless your instructor gives you that value.

    • Analyze and interpret chi-squared statistics for two variables. Recall, that for this type of analysis, you are working with a pair of variables that are dummy, categorical, or ordinal meaures. Typically, this measure will be used with variables that have no more than 5 response categories each.

    • Analyze and interpret the correlation statistics for a handful of variables. You could be asked to determine whether they could be combined used as independent variables in a regression models (i.e., would their likely be collinearity problems?). You will to interpret the correlation between pairs of variables and their statistical significance.

    • Demonstrate your understanding of the limitations of correlation. You may be asked to speculate about whether we can determine if there likely is a causal relationship between pairs of variables, with one clearly causing the other (e.g., being more conservative and being Republican). You may be asked to identify whether any of several other variables may suggest a spurious correlation between two variables (For example, being male and disapproving of Joe Biden are modestly, but still statistically associated). You might be asked why we see a very low or unexpected association between two variables that probably have a non-linear relationship (take a look at age and household income in the dataset). 

    • To run and interpret a linear regression model with three independent variables. You will be asked to identify and interpret the adjusted R-square statistic. You also will be asked to identify the equations for three scenarios that will involve different levels of one of the independent variables (including its mean, which you will need to calculate with SPSS). The other two independent variables will be set at their mean level of influence in the equations.

    • To run and interpret the same linear regression model examined before as well as up to 5 more independent variables, including an interactive term and multiple dummy variables coded from the same original variable. You will discuss the model's r-square statistic and how r-square statistics have changed as you have added additional variables to the model and what that means.

    • For the same regression model, you will also be asked about the statistical significance of different independent variables, and the meaning of some unstandardized and standardized (beta) regression coefficients. You will be expected to identify the correct reference groups when interpreting results for one or more of the dummy variables. You will be asked to interpret the results of an interactive term.

    • For the same regression model, you will asked what equations should be used to estimate the predicted value of a dependent variable for two or three individuals who are different in specific ways, controlling for their other differences Ii.e, with some variable's set at their mean level of influence). You will need to make sure you know how to create scenarios that involve dummy variables' reference categories and non-mean values for an interactive term. For at least the practice exams, you may be asked to use to Google to solve one or more scenario models.


    Topic 16 (Wednesday, 11/19) —Logistic regression and its interpretation

    • In class, you will be introduced to logistic regression, which is the type of regression used when the dependent variable is binary (i.e., when working with a dummy variable as the dependent variable).

    • Ahead of class, take another quick look at this article on the 2018 Brazilian Election. Read the abstract and intro quickly and then go straight to the methods and findings section. Read through the intepretation of the logistic regression results. 

    • Ahead of class, please take the time to carefully read and print out: this handout of SPSS logistic regression output with annotations. The document is very similar to the one I posted above for linear regression, but this time the dependent variable--support for torturing terrorism suspects to obtain information--has been recoded into a dummy variable. Respondents who said torture was sometimes, often, or always justified were coded "1"; individuals who said torture is never justified were coded "0."  

    • In class, we will start to practice interpreting logistic regression tables in the Setzler and Yanus article. We also will practice interpreting SPSS logistic regression output, including interpreting pseudo R-square statistics, statistical significance p-values, odds-ratios exp(B), and Wald statistics.


    S
    ome key concepts about logistic regression to remember

    • If your dependent variable is dichotomous, you need to use logistic regression rather than linear (OLS) regression.

    Here's why: Dummy variables work fine as independent variables in OLS regression, but when a dummy variable is the dependent variable, we need a different approach. The problem is that each one-unit increase in an independent variable typically does not have the same effect on a binary outcome.For example, assume that the number of injuries a football team has will decrease its number of points in the typical game. OLS regression works well here because it tells us how many points are lost with each additional injury. However, if we wanted to know how injuries impact whether the team is likely to win (a yes/no outcome), the relationship is more complicated. The first injury or two probably has a modest effect on winning. But after a certain threshold, each additional injury causes the team's odds of winning to drop substantially. Eventually, additional injuries don't matter anymore because the team is already overwhelmed and going to lose. Logistic regression is designed to capture these non-linear patterns that OLS regression cannot handle.

    • Use bivariate (aka binary) logistic regression only with dichotomous dependent variables that have been coded zero or one. This type of regression is often the best option when you are dealing with ordinal or categorical variables, but you need to covert those types variables into dummy dependent variables to use this type of regression. There are other types of logistic regression designed for ordinal and multi-category dependent variables, but they are less frequently used and beyond the scope of this course.

    • Interpreting logistic regression output is not very intuitive, so why do we even need to learn this?: You see logistic regression models' estimates all of the time even if you haven't realized it. Many things we want to know about--what factors are most important in determining who will win elections, whether countries go to war under certain circumstances, whether someone has been asked for a bribe--are yes/no variables that require logistic regression. And, as we have learned previously, it is very common to convert ordinal variables--especially Likert scales--into dummy dependent variables. When researchers do this, they use logistic regression.

    • Logistic regression provides information that is very similar to what we get when we use linear regression; however, the key statistics have different names and are in a different place in the SPSS output:

      • With SPSS's logistic regression output, the "pseudo r-square statistic" (use the one labeled Nagelkerke), is interpreted just like the r-square statistic in OLS regression. So, if a regression model has a pseudo R-square of .0897, it means that the variables in that model collectively explain about 9 percent of what causes the outcome predicted by the model.

      • To identify which variables are most important in explaining the outcome, compare the independent variables' "Wald" statistics. The variables with the largest Wald values explain more of the variation in the dependent variable. You may recall that there is a similar statistic for linear (OLS regression): standardized regression coefficients, also called betas. 

      • With logistic regression, we do NOT directly interpret the unstandardized regression coefficients in the first column of the variables table. Instead, we interpret how much a change in each independent variable affects the dependent variable by looking at the odds ratios.

      • Independent variables work the same way in logistic and linear regression models. We interpret dummy variables and interactive terms the same way for both types of regression; it's the phrasing and explanation of how these variables influence the dependent variable that is different.

    • How do you interpret odds ratios?

      • First you have the find the odds ratios. They are NOT in the first column of regression output. Instead, in SPSS's default output, the odds ratios are listed in the furthest-right column, under the heading ex(B).

      • Remember, we only interpret the specifics of any regression coefficient--including odds ratios--if that variable is significant (i.e., the p-value for the odds ratio is equal to or less than .05). If the p-value is greater than .05, you interpret the variable by saying something like, "This independent variable is not a statistically significant predictor; we cannot be confident that repeated sampling would show that this variable will be consistently associate with an increased (or decreased if the odds ratio is negative) likelihood of the outcome."ss

      • If an independent variable's odds-ratio is less than one (and statistically significant), there is a negative relationship between the variables. Every one unit increase in the independent variable corresponds to a decreased likelihood of the dependent variable happening

    To interpret this, think about starting with one dollar and then looking at the odds ratio. An odds ratio of .70 means you now have only 70 cents—that's 30% less than what you had before.

    If we were predicting who voted in the last election and an interval variable's odds-ratio was .70, we would say that every one unit increase in that independent variable "reduced the likelihood of voting" by 30%.

    For a dummy variable, the interpretation is similar but includes the reference group. For example, if the odds ratio for "under 25" is .70, we would say: "Compared to eligible voters who are older than 25, individuals who are under 25 were 30% less likely to vote."

      • If an odds-ratio is between 1 and 2, there is a positive relationship, and we can convert that odds-ratio into an easy-to-understand percentage. Think of it like this: If I used to have a dollar and now I have $1.67, I now have 67% more.

    For example, in a voting model, if a variable's odds-ratio is 1.545, we would say that every one unit increase in that independent variable "increased the likelihood of voting" by about 54%. Alternatively, we could say it increased the likelihood by over one and a half times.

    For dummy variables, remember to note the reference group. So, if a model had both a Republican and a Democrat dummy variable, we could say: "Compared to independents (the omitted reference group), Democrats were 54% more likely to vote."

      • Finally, if we have an odds ratio of 2 or more, we can say that every one-unit increase in the independent variable increases the likelihood of the outcome by x-times. If you were predicting a voting model and a predictor's odds ratio was 2.434, we would say that each one unit increase in this variable, "increased the likelihood of voting" by nearly two and a half times.

        And if the independent variable is a dummy variable, we would say something like, "Compared to [everyone belonging to the group omitted from the regression model], [people in the dummy-variable group] were 2.4 times as likely to have voted."

    This is optional reading, and something you probably only want to read if you are interested in getting into the weeds on this statistical method a level that is not going to be necessary for your tests. While this outtake is from one of the most-assigned political science textbook in country (this class used it for years), the detailed mathematics in the chapter are no more essential to your competent use of logistic regression than understanding the inner-workings your car engine is necessary for you to be an excellent driver. I am assigning it to you as completely optional reading because its explanation of why we need to use logistic rather than linear regression with a dichotomous variable is helpful. Carefully read through the extended example that talks about how we model something like the strength of a person's partisanship versus vote choice. The other useful part of the chapter is its explanation for why odds-ratios are the statistic that we use to interpret the influence of each variable in a logistic regression model.

    • After class, if you were engaged in class, choose to read the optional textbook chapter material on logistic regressionm and still feel like you would benefit with more information on the basics of this type of regression, you can optionally watch watch this 12 min. screencast (https://youtu.be/uUf3h8ifZxE). The screencast explains in detail what bivariate logistic regression is, how it works, and why it is often used with ordinal dependent variables after they have been recoded into 0/1 dummy variables. This screencast covers concepts rather than any SPSS work; there is another screencast below that looks at how we use SPSS to run and interpret logistic regression models. Note that watching either of the screencasts on logistic regression is optional and should not be necessary if you attend class and put your best effort into the hands-on practice exercises.


    Topic 17 (
    Friday 11/22)Let's do and interpret bivariate logistic regression with predicted probability scenarios
    . This is the last new topic that will we will cover this semester.

    While logistic regression will not be part of your end-of-term SPSS test, your final exam will ask you to interpret a logistic regression output table from SPSS, and you will need to complete a BlackBoard exercise on this topic (SPSS #6).

    • By the end of this week: If you have OARS accommodations that you intend to use for the final SPSS exam or for the final exam during finals week, please make sure to make arrangements with OARS. OARS has a deadline in place, and if you do not request accommodations in advance, you may not be able to use the testing center during the final period.  

    • Before Friday's class, carefully review the schedule's summary notes from last time we met, which is when we first discussed the logic behind logistic regression and began to practice interpreting odds ratios in logistic regression tables. If you have any concerns about your grasp on the basics of logistic regression, you should review the optional materials linked above, including a textbook chapter and the screencast that covers the same ideas we talked about in class.

    • In class, we will continue to practice interpreting SPSS bivariate logistic regression output. If you have not done so already, you should take the time to carefully read and print out: this handout of SPSS logistic regression output with annotations. The document is very similar to the one I posted above for linear regression, but this time the dependent variable--support for torturing terrorism suspects to obtain information--has been recoded into a dummy variable. Respondents who said torture was sometimes, often, or always justified were coded "1"; individuals who said torture is never justified were coded "0."

    • In class, we will also practice calculating logistic regression scenarios. As with linear regression, interpreting logistic regression analysis in an interesting way is best done with "predicted probability" scenarios. There are more details on what this will involve below,

    Some key concepts about logistic regression predicted probability scenarios to remember:

    • Instead of assigning detailed readings from a methods textbook, below you will find a summary of key ideas and concepts to explain why and how we typically report at least some of our logistic regression results using "predicted probability" scenarios rather than the odds-ratios that SPSS reports by default:

    • Why do we need to calculate predicted probabilities?: While odds ratios can be interpreted directly, predicted probabilities are easier for most audiences to understand because they express results as simple percentages..

    • What exactly is a predicted probability, and how is it different than an odds-ratio, the latter of which is listed in SPSS's default output? Odds and probabilities are different ways of conveying the same information. Odds refers to how many times an outcome occurs compared to how many times it doesn't occur. Probability is how many times an outcome occurs compared to the total possible occurrences. If this sounds confusing, consider an example: If a team has a 50% chance of winning a game, we expect it to win 1/2 of its games, so the odds of that team winning are 1:1. If a team has a 75% chance of winning, it should win 3/4 of its games, which means its odds of winning are 3:1.

    As was explained above (in the notes explaining how we interpret raw SPSS output for logistic regression), SPSS reports odds-ratios to show how a one-unit increase in an independent variable changes the likelhood of a dichotomous outcome. The reason we use odds-ratios in logistic regression is because for most yes-no outcomes, a one unit increase in an independent variable does not have the same effect on the probability of an outcome across the independent variable's range. Again, an example will help to illustrate a complicated concept:

    Let’s say a football team has a 20% chance of losing when fully healthy. That means its odds of losing are just 1:4: one expected loss for every four expected wins.

    Now suppose each injury increases the odds of losing by a 1.5 times. Because we multiply changes in odds, not probabilities, the same increase in odds does not translate into a linear change in probability. Using the odds formula:

    0 injuries: odds = 1:4 (probability of losing = 20%)

    1 injury: odds = 1.5:4 (probability of losing = 27.3%)

    2 injuries: odds = 2.25:4 (probability of losing = 36%)

    3 injuries: odds = 3.375:4 (probability of losing = 45.8%)

    4 injuries: odds = 5.06:4 (probability of losing = 55.9%)

    5 injuries: odds = 7.59:4 (probability of losing = 65.5%)

    6 injuries: odds = 11.39:4 (probability of losing = 74%)

    As you can see, the the same increase in the odds of losing with each additional injury does not have the same effect on the probability of losing, but, knowing how much each injury increases in the odds of losing allows us to predict the probability of losing at a given number of injuries. 

    The reason we go through the hassle of mathematically converting SPSS's default output of odds-ratios into predicted probabilities is that most people find probability scenarios a lot easier to understand. For most people--whether academics or everyday folks--it makes a lot more sense to convey information in probabilities, which can be expressed as percentages: In the example, a team with two injuries has a 36% chance of losing, while a team with six injuries can be expected to lose 74% of the time.

    • How can we quickly calculate predicted probabilities for specific scenarios using nothing but SPSS output and the assistance of generative AI?: To calculate predicted probability scenarios like those in the football example, we have to mathematically transform logistic regression's output for the relevant variables. While you can do these transformations inside SPSS, it is a very complicated, multi-step process to do so (other statistical programs make it much easier). Fortunately, we can use generative AI to accurately do these calculations as long a you give it the right SPSS output and an effective prompt. Here's how:

    To calculate a predicted probability scenario with the assistance of Claude.ai or ChatGPT, do the following:

    (1) Use SPSS point-and-click options to build a logistic regression model. But rather than running the model, use the paste option to put the command code for the logistical regression command into a syntax file.

    (2) Next, copy all of the independent variable names from the syntax you just pasted. And on a new line, write the word DESCRIPTIVES, paste in the variable names, and add a period to the end of the command.

    (3) Next, select the blocks of code for the logistic regression model and the full descriptives command, and run that code.

    (4) You now should have SPSS output that includes the regression coefficients table (the last table in the logistic regression output) followed below by a descriptive statistics table that lists all that model's independent variables.

    (5) Next, you need to create a text template that you will use for your scenarios. To do so, open up the descriptives table output table and copy the column of independent variable names. Paste those names into your text document, and add "= mean" after each of them. For example:
    Republican = mean
    Democrat = mean
    Age in Years = mean
    Male = mean
    and so on...

    (5) Copy the SPSS table that includes the regression coefficients table (the last table in the logistic regression output; the one that lists your odds ratio) and paste it into either Claude or ChatGPT.

    (6) Then do the same thing with the descriptive statistics table output.

    (7) Now tell the AI engine to get ready to calculate the predicted probabilities:
    "I want you to calculate some predicted probability scenarios for me, using the SPSS output for my descriptives table and the logistic regression model. In each scenario, I will provide the settings for the independent variable values I want analyzed, assuming that all other variables in the regression model are set to their mean values. Are you ready for scenarios?"

    • Now you are ready to provide the scenario information and calculate predicted probabilities:

    Copy the scenario template that you created earlier, and paste that into the AI engine (you just told it to get ready for scenarios). Modify the values of the independent variables to fit your scenario, changing ONLY the values related to the scenario you want calculated. For example, if you want to know what the probability of a 35-year-old male Republican thinks the US is losing its cultural identity, you will change the age variable from mean to 35, the Republican variable to 1, and the Male variable to 1.

    Leave all other variables the way they are (i.e., at their means) unless one of the variables in your scenario is mutually exclusive of another variable in the model. Staying the same example, if you need the predicted probability that a 35-year-old male Republican thinks something, any other party identification variables need to be set to zero. For example:
    Republican = 1
    Democrat = 0
    Age in Years = 35
    Male = 1
    And the other variables would all be = mean.

    If you want to calculate a scenario that involves a reference category dummy variable, set all of the related dummy-coded variables to zero. So, if Democrats are the reference category for your other dummy-coded party variables and you want to calculate the predicted probability of that a Democrat did or thinks something, the other party-identification variables will all be set to zero

    Finally, If you have an interaction term in your regression model, its value may need to be changed to match the scenario. For example, let's say you had a three-variable model predicting whether teams will win a game: N.injuries (to key players), PlayingAway, and injuriesXplayingAway. The interaction term is testing whether the effect of injuries on the likelihood of winning is different when playing away. If you were calculating a scenario for the probability of winning when playing away with three injuries, your scenario will include "injuriesXplayingAway = 3" because 3 (injuries) × 1 (dummy for playing away) = 3. If the scenario looked at playing at home, both PlayingAway and injuriesXplayingAway would be set to zero.


    You will be happy to know that the block of material on logistic regression ends the new material for the course.


    This is what the end of the term schedule looks like:

    This year, Thanksgiving falls very late in the semester. Because of this, the week of the the holiday break and afterward will be spent mostly reviewing or completing assessments intended to help you to prepare for exams. The last week that you will have new material in the course will be week 14
    .

    • Monday, November 17: First practice (but mandatory) SPSS exam. This is the first of the two mandatory, but ungraded SPSS exams to prepare you for the graded exam that you will take later in the semester. Remember that logistic regression will not be part of the end-of term SPSS practice or final SPSS tests.  You may bring a page of notes with you for the practice and final SPSS tests. The practice tests are mandatory; if not completed, zeroes will be entered as additional SPSS homework (BlackBoard) grades).

    • Wednesday November 19 and Friday 21: We will cover logistic regression in class. 

    • Monday, November 24. Second practice (ungraded, but timed and mandatory) SPSS test on BB. The concepts that will be covered on this test will be drawn from the same list I used (see above) for your first practice test. Remember, you may bring a page of notes with you for the SPSS tests. This practice test and the final version will not have detailed reminders of how to use SPSS for different types of problems. The practice tests are mandatory; if not completed, zeroes will be entered as additional SPSS homework (BlackBoard) grades).

    • Wednesday, November 26 and Friday, November 28: No classes (Thanksgiving)

    • Monday, December 1. Normally, the SPSS test would be on this day, but concerns about air-travel delays have led me to put it on the last day of class. The expectation is that you will be in class on Monday, barring unplanned travel issues. This will be a day for you to ask questions about the final exam and then to work on either your logistic regression assignment or to practice more for the SPSS test. 

    • Tuesday, December 3 by 5 pm. SPSS #6 assignment on logistic regression due.  This is the last SPSS Blackboard assignment of the semester. 

    • Wednesday, December 3, SPSS test (10% of the course grade).  This test will be very similar in format to the practice test you took before the holiday break. The concepts that will be covered on this test will be drawn from the same list I posted for your first practice test.

    • Thursday, December 4. All of the BlackBoard SPSS workshops will be closed permanently at 10pm on Thursday.

    • Your final exam for this class will be when the University has scheduled our exam period: Saturday, December 6 at 8am. All students will be required to take the third unit exam during the final exam period. You also will be writing an essay. If you miss the SPSS test for whatever reason, you will take that test during the last part of the final exam period.