Please note: At your instructor's discretion,
there may be minor alterations to the reading
assignments listed below. One of the major advantages to
providing you with an on-line readings archive is that
timely articles can be added or substituted when
appropriate. Opening documents downloaded from this
website will require that your computer have
Acrobat Reader . You will also need the
class-specific password to open individual files.
UNIT 3 ASSIGNMENT SCHEDULE
Links to helpful resources:
Week 12
Topic 13 (Monday, 11/4)—Determining whether two or
more variables' means and proportions are statistically
different with T-tests.
In most social science
studies, researchers want to explore whether variation in
one or more independent variables correspond to
differences in a dependent variable. In the sample article
you were asked to read this week, it was hypothesized that
certain characteristics and views made a person more
likely to vote for a controversial Brazilian candidate
(measure with a dummy variable). In class, we asked
whether folks who vary by gender, partisanship, or race
have different average household incomes (an interval
variable). Now we will learn how to see if differences
among observed in a sample are likely to hold for the
larger population.
-
What is a t-test? If
we just want to see if how different groups of
respondents varied in how they answered a survey
question, we can split our data and then calculate
descriptive statistics for a dependent variable we
care about. And it can be helpful to display those
differences in barc harts, which can be quickly
created in a spreadsheet. However, how do we know if
any differences we are seeing in our sample are large
enough that we would expect that finding to hold for a
larger, more general population? To test whether this
is the case, we use T-tests. For example, we could use
a T-test to see if what we saw in our sample--that
more Republicans make a little more money than
Democrats--would hold up if we repeatedly surveyed
representative samples of Americans.
-
Ahead of class, carefully read the first
sections of Chapter 8, "Bivariate Analyses"
in Carolyn Forestiere's textbook. Read just up
to the section on correlation analysis (i.e.,
just the first six pages of the chapter). Correlation
analysis will be covered after the Unit 2 test.
-
Before class, read
about how to do T-tests in SPSS: This topic
is covered in the how-to handout for statistical
analysis to add a summaries about how to use
SPSS for both types of T-tests described in this
week's materials.
-
Print out and keep
handy this one-page handout of annotated SPSS
output for T-tests.
When and how do we use
independent sample t-tests?
-
Forestiere's chapter talks the most commonly used
t-test, an "independent samples test." This test assumes you are
looking at whether two groups coded on the same independent variable have statistically
different means for a second variable.
Staying with the example I just gave, this test only
be appropriate if you have a categorical variable
where respondents were coded something like:
1=Democrat, 2=Republican, 3=Independent, and 4=Other
party.
Some things to remember from
the video:
(1) To run an independent
samples T-test; Analyze -> Compare means ->
Independent samples T-test. Then, select a variable
whose mean you want to examine across several
subgroups.
(2) You next need to
specify which values of the grouping variable will be
compared (click on the button that says "defined
groups). In the sample video, Republicans were coded
one; Democrats two, and Independents 3 in the original
dataset. To compare the means of Republicans versus
Independents, the values 1 and 3 would be specified.
(3) Make sure that you are
looking at the correct block of results and the correct
column to determine if the difference in means in
statistically significant. The significance test you
want is in the bottom block of output (not the "Group
Statistics," but rather in the block "Independent
Samples Test." In that block, look at the top row of
results ("Equal variances assumed") and find the column
labeled Sig. (2-tailed). To repeat, the one you are
looking for is in the row for "Equal variances assumed."
(4) Only if two-tailed
significance statistic is SMALLER than .05 can we say
with any confidence that the mean values for the two
groups are statistically different and that we would
reach the same conclusion if we drew repeated samples
from the same population; conversely, a significance
statistic that is LARGER than .05 indicates that the two
groups do not have statistically different means.
When and how do we use single-sample t-tests?
-
Forestiere's chapter does not discuss a second type
of t-test, the single sample t-test, which can do
everything an independent-sample T-test does and more.
Since she doesn't cover this second type of T-test,
please read through the paragraphs below very
carefully.
-
A single sample t-test determines whether the
mean value for a group on a dependent variable is different than the
mean value for another group (or groups) that are
measured by other variables.
Your textbook refers to the "DataPrac" survey, which
is a dataset that comes with your textbook so that you
can practice methods discussed in the book. We are not
using that dataset this semester; however, if you were
to analyze the DataPrac survey's variable D72, you
would see that the typical American (i.e. the survey
mean) had a response value of 7.17 on a 10-point scale
that measures religiosity. For this indicator,
respondents were asked to place themselves on a scale
where 1 means that God is "not at all important" in
their life and a ten indicates they believe God is
"very important" in their lives. Is the sample's mean
value for this variable different than the mean for
individuals who say they plan to vote for the
Democratic candidate in the next election? How about
Republicans? Is the religiosity of men lower or higher
than the national average for this item? How about
women?
-
We can answer each of these questions and even build
a table comparing them if we split our data on each of
the relevant independent variables and then run a single sample
T-Test (Analyze-> Compare Means -> One Sample
T-test) for each group we care about. For
each of the tests, we would enter the average for the
sample as a whole, 7.17, into the place in SPSS that
asks for the "Test value."
So, if were split the data by the DataPrac variable
D14 (partisanship) and run a one sample t-test in SPSS
with a test value of 7.17, we see that the mean score
for Democrats on the importance of God in their life
is 6.60, and the two-sided significance T test reports
that the difference for the test value and the mean
for Democrats is significant at the .001 level. The
same test shows that the average Republican score for
the importance-of-God measure is 8.25, which is
significant at the <.000 level. In other words, if
we survey similar samples 1000 times, we would always
expect to find that the average value for Democrats on
this variable was always lower than the national
average. And separate t-test (because the partisanship
variable was split) shows us that the average for
Republicans should always be higher than the national
average. To see if men or women also are different
from the national average for this religiosity
variable, we would just need to go back to Data->
Split File -> Compare Groups and swap out the
variable D14 for the gender variable (we would leave
the test value at 7.17, the average for the full
sample).
-
Optional: Watch after
class, if you need more guidance on
calculating an one sample t-tests: https://youtu.be/paUIJ3Eh7JI
(a little over five minutes). In the video, a test is
run to see whether the average for the variable male
(coded 0/1) is different than the value of .50, which
is about the percentage of men we would expect to find
in a nationally representative sample (e.g., it would
be due to something other than sample error if we were
to find that 60% of a 3000-person random sample was
male.
Some things to remember from
the video (so you don't need to watch it more than
once... or maybe even at all):
(1) To run a one sample
T-test; Analyze ->Compare means -> One
sample T-test. Then, select a variable whose mean you
want to examine.
(2) You next need to
specify a "test value" to which you want to compare the
mean for your variable of interest. In the video, the
mean for the variable male is .54, which is compared to
an expected value of .50. Per the commentary above, the
test value of .50 was used to see if there are more
males in this sample than one would expect to find in a
nationally representative survey.
(3) Only if the
significance statistic for the "two-sided p" result is
SMALLER than .05 can you say with any confidence that
that the mean value observed in the sample is truly
DIFFERENT than the expected value; a significance
statistic that is LARGER than .05 indicates that the the
observed and expected mean are not statistically
different. In this particular example, a value greater
than .05 would mean that the sample's portion of men is
not larger than the 50% would expect to find in a
nationally representative sample.
(4) It is not covered in this screencast, but
remember from the religiosity example above, if you
want to test whether the mean for a subgroup (perhaps
women, or Republicans, or Catholics) is different than
some value (maybe 50%, or the sample average, or the
mean of some other group you care about), you can
split your data to isolate the subgroup and then run
the T-test).
Wed, 11/6—Lab time to
work on SPSS #3
in BlackBoard, which will be due
by 6 pm on Friday.
Topic 14 (Friday, 11/8)—Determining whether and how much
two variables are "associated" with one another
-
In class,
we will focus on three of the many methods social
scientists use to determine how two variables are
associated: chi-square tests, correlation tests, and
bivariate regression..
-
In class, most of our time will be spent
continuing to look at bivariate correlation and--if
there's time--regression with one independent and
dependent variable.
-
After class,
please finish Chapter 8 in your Forestiere textbook (you
previously read up through the section on t-tests, so start on the section for
correlation). Please read the political
science examples carefully. It will be faster,
easier-to-understand reading if you wait to complete
this reading until after we have covered correlation
in class.
-
After class,
please read the first 15 pages or so of Chapter 9,
"Regression," in your Forestiere textbook (up to page 199). Review the
section on regression with one independent and
dependent variable. It
will be faster reading if you wait to complete this
reading after we have begun to discuss regression in
class.
For
X2 (chi square) and other association statistics,
here are the key concepts you need to remember from
class:
-
Before class,
but after you have read about correlation tests in
the textbook, read through the block of
material below carefully and quickly read a very
short reading on what a chi-square test is to give
you a clearer idea what SPSS is doing when it runs
this kind of an association test. Chi-squared
tests are not covered in your textbook,
so you need to review this statistical measure in
the material below and the assigned, short
reading.
-
Before class, print out and have handy this annotated SPSS output for a
Chi-Square test. In the sample, the
researcher is trying to see if a Brazilian's race (a
categorical variable) had anything to do with
whether or not the voted for the politician who was
elected president in 2018.
-
Because they are the most used association tests in
political science and international relations
research, we are focusing mostly on correlation
and regression. They are the only association
measures covered in any detail in the Forestiere
textbook. Other
than correlation and regression, the only
association statistic you need to be familiar with
for this course is the chi-square (x2) test.
-
To calculate a Chi-square test, use SPSS's
Analyze -> Descriptives-> Crosstabs, If
you think that one one variable is the cause of the
other, the independent variable (the cause)
typically goes into the rows, while the dependent
variable should be listed in columns window. To make
the table useful, go to the "cells" option, and
check only the box for "rows." Check also, the boxes
for "observations" and "percentages." Then, in the
"statistics" option, check the box for a Chi-squared
test. If you need more guidance on this
procedure, you can watch this screencast: https://youtu.be/7O3UTYL2A-I
(about
-
To get a basic understanding of what a chi-square
test is and how association measures work, read
carefully just the first seven pages of this document (Read
up to the section "residuals"). Here is a summary of
what the reading says, with a simplified example:
The main point of the reading
is that a chi-square test provides a statistical test
to determine whether any association between two
nominal/ordinal variables we see in our sample data is
due to chance. In other words, what is the
probability that repeated sampling would find that a
respondent's category for one variable would have nothing
to do with their category in a second variable.
An example can provide a basic
idea of what a chi-square test looks at. Let’s say that
that we have 1000-person sample where exactly half of the
individuals have identified as women and half as men. This
being a sample from an odd, hypothetical US state, we also
have a sample with exactly 50% Democrats, 50% Republicans,
and no independents.
If gender has no association
all with partisanship in our sample, we would expect to
see that 25% of our sample is made up of female Democrats,
25% female Republicans, 25% male Democrats, and the final
25% male Republicans.
However, a hypothetical
analysis might reveal that 30% of our sample is made up of
female Democrats and 30% is made up of male Democrats.
Thus, in our sample, it looks like there is an association
between gender and partisanship (specifically, more women
than expected are Democrats and fewer are Republican).
The chi-squared test will tell
us (and ONLY tell us) whether the association between
gender and partisanship that we are seeing in our sample
is due to sampling-error chance. The p-value for the test
will tell us what is the probability that repeated
sampling would sometimes find that women are more likely
than men to identify as Republicans which is contrary to
our hypothesis and the finding in our sample.
A p-value of .05 or smaller
for the chi-square statistics tells us that there is
only a 5% chance or less that the association we are
seeing is our sample is due to chance (i.e., survey
error, which a function of sample size) and that we
should expect repeated sampling to show a similar
association at least 95 percent of the time. Given
the magnitude of gender differences in the hypothetical
sample above and its size (n=1000), the chi-square test
would be significant in this case.
However, as is the case
with statistical techniques generally, if you are using
a very small sample or looking at a variable where you
have very few individuals in some response categories, a
chi-square test may not return a statistically
significant result. This is why it is important to
run frequencies on variables and think carefully about
whether response categories should be combined (e.g., it
is very common to see a multi-racial measure be recoded
into a white/non-white dummy variable before analysis if
the sample is under 600 or so resp0ndents).
If you want to be a
competent consumer of social science research, you
should be aware that there are other statistical
methods that can provide more accurate tests of
association when you are looking at the relationship
between any specific combination of two categorical,
dummy, or ordinal variables. If you are
curious, here is a summary of the association tests
SPSS can quickly compute: https://www.ibm.com/docs/en/spss-statistics/25.0.0?topic=crosstabs-statistics.
We are learning only about chi-square tests because
they are widely reported in both academic and everyday
publications.
For correlation statistics,
here are the key concepts you need to remember from
the chapter and class:
-
To run a correlation in SPSS, use Analyze ->
correlate -> bivariate. If you enter
more than two variables, you will get a "correlation
matrix," showing you the relationship between each
pair of variables.
-
Correlation measures should interpreted with close attention to whether or not they are statistically significant.
If the
p-value is higher than .05, there is no association regardless of how large the
correlation statistic is. For correlation,
the p-value tells you the probability that any
association between what has been found in the
sample could be zero or signed in the opposite
direction in repeated sampling. To say that one
variable is a statistically significant predictor of
another, the p-value needs to be .05 or less. In
SPSS, make sure to look at the p-value even if you
see two asterisks. For some odd reason, the default
setting in SPSS only adds two asterisksto
coefficients even when the p-value is <=.001,
which should be denoted with three asterisks.
-
Correlation does
NOT tell you how much a change in one variable
changes the other variable. It also cannot tell you which
variable may
be causing
the other to move. For example, being more
conservative is correlated with being more
religious, but there are theories to suggest that
causality could go either way.
-
Moreover, even if two variables are highly
correlated, it
could be that there is a third variable that is causing both x and y to
change in predictable ways even though those two variables have no actual
relationship. For example, in the US,
violent crime goes up in the same months that ice
cream consumption also goes up in a population, but
they don't have anything to do with each other
except that both are more prevalent on hot summer
days. "Omitted variable bias" is one of the reasons
we will be talking about multivariate regression
models next week.
-
Most of the association statistics range from -1 to
1, and a negative
correlation statistic means that increased values
in one variable is associated with declines in the
other variable. Typically, positive
correlation statistics are not marked with a plus sign.
-
The square of
the correlation coefficient (r-squared), is used
to estimate how much of the variation in one
variable is "explained" by the other one
with a key caveat noted above: a missing variable
may be explaining some or all the variation...
which is why we typically look at relationships
between two variables with multivariate regression
that includes one or more "control variables"
(more on that next week).
As a general guideline for thinking about the
association statistics, like correlation:
<.10 means that there is a very weak or no
association between the variables;
.20 can be interpreted as a meaningful but modest
association;
.30 is a moderate association, and
>.40 is a strong association,
But in every case, you need to put these findings
into context (e.g., a .40 association between
being Republican and being conservative would be a
much weaker finding than you would anticipate, so it
wouldn't make sense to refer to this scenario as being
evidence of a very strong association).
-
Correlation
assumes that the association between two variables
is linear and the same for different values of the
variables.
There can be a close relationship between two
variables, but correlation won't measure it if the
relationship is not linear. Think about a
person's age and their physical independence. For a
while, each year mean more physical independence,
but at a certain point, the relationship become
negative; there is a curvilinear relationship
between age and physical independence. With income
and happiness, increases in income initially push
happiness up consistently. However, at a certain
point (about $100K annually in today's
dollars), more income doesn't seem to improve or
decrease happiness. So, the relationship looks
sort of like a backwards 7. And then, there's the
relationship between time and the growth of invested
money, AKA, the miracle of compound
interest, which looks kind of like going along a
road with a small climb and then starting up a
mountain. These are three common non-linear
relationships. There are ways to address these type of
relationships, but we won't look at them. One
option that fits with what you will learn in this
class is to create dummy variables and interaction
terms at the key pivot points (for age, that might
Under21, 21to65, and OlderThan65.
For linear regression with one independent
variable (i.e., bivariate regression),
here are the key concepts you need to remember
from the chapter and class:
-
Linear (i.e., OLS--Ordinary Least Squares)
regression models report an R-square statistic,
which is interpreted as noted above in the section
on correlation. An
R-square statistic of .35 means that the
independent variable in the regression model
explain 35% of the variation in the dependent
variable (and doesn't explain 65%). You
will get some sample language on reporting R-square
results in the section on regression with multiple
variables below.
-
Regression estimates how much a one unit
increase in the independent variable will
correspond to changes in the dependent variable. Specifically,
regression output includes a slope measure for each
independent variable. This statistic is
called the unstandardized regression
coefficient. In SPSS output,
unstandardized coefficients are listed in the "b"
column of output; make sure you are looking at the
first column output in the last block of output).
This regression coefficient tells us how much "each
one-unit increase in the independent variable
corresponds with an x-unit increase (or decrease if
the coefficient is negative) in the dependent
variable." In plain English, we might say each one
unit-increase in the 10-point measure of
religiosity, corresponds to a 1.34 point increase in
an individual's measure on the 10-point
ideological-conservativeness scale."
-
With regression, there is a statistical test for
each regression coefficient where the p-value
tells you the probability that relationship
between what has been found in the sample could be
zero or even run in the opposite direction in
repeated sampling. To say that one variable is
a statistically significant predictor of another,
the p-value needs to be .05 or less. These models
include a value for the y-intercept (in the output,
this is the unstandardized "constant).
-
With regression results, you can predict the
value of the dependent variable at selected values
of an independent variable. The constant
(aka, the y-axis intercept) can be used to predict
the value of the dependent variable for a given
scenario with a simple formula: DV value =
Constant + (a specified value of the IV times the
regression coefficient). If the IV is a dummy
variable, the language used to interpret its
regression coefficient is: "Compared to the
reference category of (carefully describe anyone who
is not in the group), individuals who are in the
group had an x-point higher (or lower if the dummy
variable coefficient is negative) value on the
dependent variable. In plain English, this might
sound like, "Compared to non-Republicans--that is,
Democrats plus independents--Republicans' score on
the 10-point measure of religiosity was 2.4 points
higher."
-
As with correlation generally, regression models
assume that every one unit increase in the
independent variable will have the same effect on
the dependent variable. This is referred to
as the assumption of linearity. Examples of
how variables can be related with one another but
not have a linear relationship include time and
investments (over time, investment returns compound
so growth is exponential) and the curvilinear
relationship between age and physical independence.
Regression can handle these types of relationships
in a few different ways, one of which is using dummy
variables and interaction terms (this isn't the most
common way, but it is the only way that fits neatly
with concepts you already are going to learn in this
class). If we suspected that age has a different
effect on income, a series of dummy variables (say,
reference group = under 35, with additional dummies
for 35-50, 51 - 65, 66-75, and over 75 years old)
likely would show that wage-earned income, on
average, quickly increases as one moves
.
-
Completely
optional: if you have attended class and
carefully read the textbook material on
correlation but feel like you would like to go
over the basics of this method one more time, you
can watch this 25-minute (12.5 at x2 speed)
screencast presentation covering the logic and
main concepts of correlation: https://youtu.be/pjDDBrunB1A.
Note: The screencast goes over the same conceptual
material we will have reviewed in class, and doesn't
cover the use of SPSS.
-
Completely
optional: if you have attended class and
carefully read the textbook chapter material on
bivariate regression but still feel like you would
like to better understand the basics of this
method, you can watch this 19-minute (10
at x2 speed) screencast presentation covering the
basics and logic of bivariate regression: https://youtu.be/K8A6xGIXPR8.
Note: The screencast goes over the same conceptual
material we will have reviewed in class, and doesn't
cover the use of SPSS.
Week 13
Topic 15 (Monday, 11/11; Wednesday 11/13)—Linear
regression with multiple independent variables,
including dummy and interval variables
Most of the topics below will be covered in Monday's
class. We will finish up any remaining material on
Wednesday and spend the rest of that class working on a
BlackBoard assignment that asks you to use SPSS to use and
interpret linear regression models.
-
SPSS #4
(posted to Blackboard) is due by 6pm on
Wednesday. This is the
assignment that gives you more practice coding dummy
variables and running/interpreting T-tests. The
assignment also covers the main concepts behind
Chi-squared tests and correlations as well as how to
run these analyses and interpret their results in
SPSS.
-
Ahead of Monday's class, starting with page
199, read to the end of Chapter 9, "Regression," in your Forestiere textbook (17pp).
-
Print out this document ahead of class: A handout of SPSS linerar
regression output with annotations.
The handout matches up with the same topic as the
screencasts (using SPSS to predict the level of
support for torturing terrorism suspects, as measured
by a 7-point likert scale (1 = "never justified"; 7 =
"always justified") that is being treated here as an
interval variable. You should retain your copy of this
document because you will find it handy when you
complete the BlackBoard assignment on regression.
-
Ahead of Wednesday's class, read the first
13 pages of this conference paper by Dr.
Setzler and Dr. Yanus (up to the note:
TABLE 3 ABOUT HERE). This is a first-draft,
working-paper version of a study written for a
conference section that is primarily interested in
gender and politics. The tone and setup of the paper
was targeted at a particular audience). A revised, more-focused version of
the study adopted a more neutral tone and was
published by one of the American Political Science
Association's journals; the article has been cited in over 100 other
published studies.
The reason you are being asked to read this particular
paper is because much of it focuses on what factors
predict whether a person values gender equality as
measured by variables that can be treated as linear in
nature. In other words, it is a study that uses linear
regression (the published version only uses "logistic"
regression). The conference paper version also includes
a few dummy-coded dependent variables, so you will be
asked to reread the same paper when we get the section
of our course devoted to binary logistic regression.
Below are summary notes for the major concepts
covered in class and your textbook to understand how
multivariate linear regression works and is
interpreted:
General concepts for multivariate linear
regression (i.e., more than one independent variable),
Here are the key ideas you need to remember from
the textbook chapter on regression and class:
-
All of the key concepts listed above for
bivariate regression (i.e., one independent
variable) apply to multivariate regression, too:
The R-Square statistic, constant, statistical
significance statistics, are interpreted in the same
way. All regression assumes that each independent
variable has a linear effect with the dependent
variable (see what this means by reviewing the notes
under correlation and bivarate linear regression
that explain the "assumption of linearity").
Also, the individual unstandardized regression
coefficients are all interpreted a similar way as
they are with bivariate regression except
that results for each independent variable caculates
how much a one-unit increase in that independent
variable increases/decreases the value of the
dependent variable when the
influence of all other variables
in the model is held constant at their mean value.
For example, if you are looking at the effect of
each additional level of education on income and the
only other independent variable in the model is the
dummy variable Male, the regression results
for the education variable would be calculated with an
equation that controls for the effect of gender by
calculating .5 x the positive effect of being a male
(i.e., the regression coefficient for the gender
variable). Why .5? This is the mean value for male.
The interpretation of multivariate regression
is a bit more completed when you have interaction
variables (i.e. MaleXRepublican) or multiple
dummy variables for the same variable (e.g.,
race or political party dummies). These type of
variables are discussed below.
-
As with bivariate regression, multivariate
regression results allow you to predict the value
of the dependent variable under different
scenarios. The way this is done is to
assign the scenario values for any scenario you
want and independent variables' mean values
otherwise. More details on this below, but
this is the key idea for how scenarios work.
-
With multivariate regression, you determine
which variables are most important by comparing
their "standardized regression coefficients."
These are also called "betas" are located in the
SPSS output column labeled as such. Recall that we
can compare standard deviations of different types
of measures in useful ways. So, comparing a 34 score
on an ACT to a 1500 on an SAT is not easy, but
comparing how many standard deviations each of
scores is from their test mean would tell you that
that ACT score represents relatively higher
performance. A beta tell us how much each one
standard deviation in an independent variable
increases/decreases the dependent variable's value,
measured in standard deviations. In other words,
the farther away a given independent variable's
beta is from 0 (beta can be positive or negative),
the more important that particular independent variable
is in predicting the dependent variable's value.garding the use of
dummy variables,
- With
multivariate regression, there is the added
assumption that each of the independent variables
is at least somewhat independent of the others. If
two or more independent variables are very highly
correlated, the statistical results in the model may
be able to determine how changes in those two
variables influence the dependent variables. See
the note above on multicollinearity.
To use and
interpret regression output with dummy variables,
here are the key concepts you need to remember from
class and your textbook:
-
Critical: to
interpret any dummy variable in a regression
model's results, you have to know what the
variable's reference category is.
-
Whenever you
interpret
a dummy variable, your interpretation should explicitly identify the reference
category.
For example, if a regression model only includes the
variable Latino, the interpretation of that
variable's regression results for this variable will
start with this phrasing like, "Compared to the
typical non-Latino, the estimated household income
of a typical Latino is $1,300 less a year after
controlling for the other variables in the
regression model."
-
If there is just
one dummy variable in a model and it was coded
from an original variable that had just two
response categories, the reference category
is easy to identify. For example, we might have data
where respondents have been coded one if they
believe freedom is more important than equality and
zero if they think equality is more important than
freedom. If the regression model includes the dummy
independent variable FreedomIsMostImportant, that
variable should be interpreted with phrasing like
this: "Compared to respondents who prioritize
equality, those who think freedom is more important
had an x-point higher score on the 10-point
dependent variable measuring y."
-
If a regression
model includes a dummy variable derived from a
multi-category original variable, think carefully
about how many of those groups have dummy
variables in the model and thus what the proper
reference category is. Consider a
(logistic) regression model looking a persons'
characteristics to predict how much they think NATO
is important to international security, measured on
a 5-point scale. Let's say this model includes
"Democrat" as its only partisan dummy variable. If
so, the reference group when interpreting the
Democrat coefficient is non-Democrats (i.e., since
both independents and Republicans are not in the
model, both groups are the reference category). In
this example, the variable Democrat should
be interpreted with phrasing like this: "Compared to
Republican and independent respondents Democrats
x-point higher score on the 5-point indicator for
seeing NATO as important to international security."
-
If you are running
a regression model to test a hypothesis comparing
two groups, one of those groups needs to be the
reference category. Sticking with the
example of looking at how partisanship shapes voting
for female candidates (something Dr. Setzler has
written a lot about, incidentally), if we want to
compare the likelihood of Democrats of voting for a
female candidate to Republicans' likelihood, we
would need to add a second dummy variable to the
regression model for the people who are
independents. Once the regression model included
dummy variables for both independents and Democrats,
those variables regression coefficients could be
compared to Republicans, who would be the omitted
reference group.
To use and interpret interactive-term
variables, here are the key concepts
you need to remember from class:
-
We create and use
an interaction term in a regression model when we
think that the influence of one independent
variable on the dependent variable depends on the
value of a second independent variable.
For example, we might want to analyze how being a
political science major and the number of hours of
studied before an exam in general education
political science classes influences the typical
student's test score.
Let's say we collected a year's worth of survey data
on students' study habits and their test scores in
these types of classes. If we were to calculate a
regression model with just these two variables, we
presumably would find that both factors are
significant predictors of higher test grades but
that there are lots of other factors, too (i.e., our
model probably wouldn't have a high r-square
statistic).
If we saw that both being a major and studying
improved test scores, we might wonder if the effect
of each hour of additional study is different for
PSC and non-PSC majors. Maybe, studying pays off
more for non-majors because they have less of a
background in the subject area and more to learn. On
the other hand, maybe studying pays off more for PSC
majors because they have more interest in the
subject and are better able to retain information
about it.
To test either of these hypotheses, we need to
create and add an interaction term to our regression
model.
-
To create an
interactive term variable, you just create a new
variable that multiplies together each
respondent's value for the two relevant variables.
In the example above, we would use SPSS to create a
new variable with coding that looks something like:
COMPUTE NewVariable = first_IV * second_iv
So, in this case:
COMPUTE PSCxHrsOfStudyBeforeExam = PSC *
HrsOfStudyBeforeExam.
The"interactive term" would then be added to the
regression model along with the two variables from
which it was formed (both of the original variables
and the interactive term must stay
in the regression model). And then we would
rerun the regression model and look at our results.
-
If an
unstandardized coefficient for the interactive
term is positive and significant, we know
that the combination of the two variables has more
of a positive effect on the independent variable
than just the additive effect of each variables. In
the example, this would mean that each hour of study
is paying off more for PSC majors than non-majors.
-
If an interactive
term's coefficient is negative and significant,
it means that the combination of the two variables
has less than the full effect we would expect if we
were to add their full effect together. In the
example, this would mean that the effect of each
additional hour of study is less for PSC majors than
non-majors.
-
If an interactive
term's
coefficient
is not statistically significant, the value
of the second independent variable does not
influence the relationship between the first
variable and the dependent variable. In the example,
this would mean that there is no difference in the
grade improvement payoff of additional studying for
PSC majors and non-majors.
In Wednesday's class, we will practice calculating
scenarios with regression output. What is a scenario, and
what key concepts do you need to remember from
class? The standard way to talk
about the effects of different variables in regression
models is to use a regression results table to explain
how a one-unit increase in a given independent variable
changes the value of the dependent variable when the
effect of all of the other variables is set at each
variable's mean value. While that makes for nice,
succinct tables, discussing situations that compare
different types of hypothetical individuals who are
otherwise similar can help to bring regression results
alive and make them more useful. For example, if
we were analyzing how much different kinds of people
support the legalization of marijuana, we could
calculate a regression model looking at the influence of
gender, religious denomination (i.e., dummy variables
for several of them), and age with controls for a
person's income and education. Using the model's
results, we could compare the level of support for a
30-year old male secular versus a 60-year old female
evangelical Protestant with comparable incomes and
educational backgrounds.
-
How do you calculate a regression scenario? To create
a regression results equation, first identify the
value of the unstandardized coefficient for the
"Constant" (use the value listed in the B column of
the SPSS results output box--the last one--that
lists each variable). Then, to that constant value
add the effect of each independent variable (i.e.,
its unstandardized regression coefficient)
multiplied by a value you specify. If you want
to control for any variables, their specified
values are their mean for the full sample.
-
Here's an example of a complex scenario (only
because of the number of variables is pretty high),
The example uses the descriptive statistics and
linear regression models reported in the Setzler
and Yanus paper you were asked to read ahead of
class. If you hadn't just read that study, I
would have used a simpler example.
Let's use the article's
regression tables and descriptives to create scenario
comparing two people's score on the 7-point measure
looking at a person's indifference-to-gender-equality (as
measured by how unimportant a person thinks it is to fight
for gender equality). What is the predicted score for an
older Republican male without a college degree who is
otherwise similar to other Americans (at least on the
variables in the model)? What is the predicted score for a
female non-Republican (i.e., independent or Democrat) who
is otherwise identical to her male Republican peer?
To calculate the expected indifference-to-gender-equality
score for the Republican male, we would use the following
formula:
The model's unstandardized constant (1.046)
+ 1 times the unstandardized
regression coefficient for Republican (i.e., the scenario)
+ 1 times the
Male coefficient
+ 1 times the
No college degree coefficient
+ 1 times the
aged 45 and older coefficient
+ 0 times the
age 30-44 year coefficient
And now we'll add in the controls, using their mean
values in the scenario):
+ .792 times the White coefficient (i.e., 79.2% of this
sample is were white)
+ .502 times the Religiosity coefficient
+ .441 times the Blue collar coefficient
+ .399 times the Rural coefficient
+ .500 times the Authoritarianism coefficient
+ .294 times the Racial animus coefficient
= The individual's expected value on the 7-point
indifference to gender inequality measure.
Formula in hand, you can plug the whole equation into the
super-cool online computer at https://www.wolframalpha.com/:
1.046 + (1 * .912) + (.792 * .119) + (1 * .444) + (1 *
-.097) + (1 * -.246) + (0 * -.078) + (.502 * -.082)+
(.441 * .190) + (.399 * .013) + (.500 * -.552) + (.294 *
2.69)
WolfFram Alpha tells us that all other things being
typical, an otherwise typical older, Republican male with
no college degree has an estimated
gender-equality-indifference score of 2.71.
Using the same equation in WolframAlpha and changing just
the scenario values for Republican and Male variables, we
see that the expected value for the female non-Republican
is 1.36 on the 7-point scale. That's just about half the
value we found for Republican males.
- Scenarios are helpful in understanding and
visualizing the effects of interaction terms. In
the example above, we might theorize that that being
not having a college degree makes men particularly
inclined towards sexist views. This is another way of
saying that the effect of educational attainment on
sexism is different by gender. To test this
hypothesis, we would create an interactive term (here,
the dummy MaleXNoCollege) and add it to our regression
model along with the variables Male and NoCollege. In
calculating our scenario, our equation would include:
(1 x the coefficent for Male) + (1 x the coefficient
for NoCollege) + (1 x the coefficient for
MaleXNoCollege). The scenario for value for
MaleXNoCollege is equal to the scenario values for
Male and NoCollege multiplied together.
Finally, I have compiled several screencasts on
linear regression that cover the same topics we go
over in class. If you feel like you need additional
guidance, check out the optional resources
below. Watching any or all of the
screencasts ishould not be necessary if you attend class
and put your best effort into the hands-on practice
exercises:
- Optional: After
practicing in class, if you still feel like you
guidance on calculating and interpreting linear
regression models in SPSS, review this 11 min.
screencast that https://youtu.be/xzl8OxPsM8s.
-
Also optional: This
13-minute screencast follows up on the last one with
a focus on how you
use SPSS to analyze dummy variables and how
output/tables with these variables are
interpreted: https://youtu.be/I2BEi_CkzK0.
-
Also optional:
https://youtu.be/3m66P8PaD3U. In about ten
minutes, this presentation
explains how we can use linear regression output
to predict the value of the dependent variable
with different scenarios involving the
independent variables (something I just covered in
the example above). The standard output in
regression will tell us how a one-unit or one
standard deviation increase in an independent
variable will change the dependent variable when the
effect of all of the other variables is set at each
variables mean value. While that makes for nice
charts, scenarios can help to bring the data alive.
For example, what is the level of support for
torture on a seven point scale for a 60-year old
Republican male who attends church a lot versus a
30-year old female secular Democrat? Here is a handout with the output and
calculations covered in the video.
-
One important
thing not covered in the last screencast is how
control variables work in regression scenarios. As noted
abov, including controls is straightforward. Let's
say that we wanted to calculate the same regression
model and scenarios the screencast, but control for
the effect of a person's education. If education
were a five-point measure, we would calculate its
mean for the full sample and our scenario equation
would include one more added component: mean of Edu5
multiplied by this variables unstandardized
regression coefficient.
Friday, November 15. Class time will be used to finish
up our overview of linear regression and--if time is
available for you to work on SPSS #5 in Blackboard.
Saturday,
November 16, by 5pm: SPSS #5 (posted in
BlackBoard.
This assignment is focuses on linear regression, which is
the last topic that will be covered in your practice and
final SPSS tests.
Practice SPSS exam (Monday,
November 18). This
is the first of the two mandatory, but ungraded SPSS exams
to prepare you for the graded exam that you will take
later in the semester.
For the practice and
final version of the tests, you could be asked to any
of the the
following
exercise (i.e., you may not bed asked to do
all of these things on any one test, but you will not be
asked to do anything that is not listed below):
-
Create a new variable. You could be asked to create a
new dummy variable that combines information from
either one or two original variables. You should be
able o label the new variable and its and response
categories. Important: You may bring a notecard (or 3x5" piece of paper with you for the tests); the only thing
that can be written on that card is sample syntax
reminding you how to create and label a new
variable.
-
Split a variable into its subgroups and compare the
frequency at which different subgroups for that
variable have a particular opinion.
-
Make a bar chart in SPSS (not Excel because
of concerns about how long that might take). Your bar
chart must show the percentage
of one of single variable's subgroups that have
a particular opinion (e.g., what respondents whose
households make $10K or less think about a an issue).
-
Compare two or more variables' means and standard
deviations, explaining what the standard deviations
tell us about the distribution of each variable. For
example, comparing the typical and range of opinions
expressed by Republicans and Democrats about whether
"God has given the US a special role in human
history."
-
Statistically test whether the typical individual in
each of two subgroups that are coded on the same
variable are statistically different from one another
with respect to a specific attitude (For example, does
the typical African American think that "college is a
gamble" (versus being a smart investment" than the
typical Latino; is any difference statistically
significant)? Hint: you need an independent-samples
t-test to answer this type of question.
-
Statistically test whether the share of a specific
group in the sample is statistically different than
what it should be given a known parameter for the US
as a whole. For example, African-Americans make up
12.1% of the US population. What percentage of this
sample is African American? Are African Americans
underrepresented in the sample (when analyzed without
teh survey's weights turned on). Is any difference
between the share of African Americans in the sample
and what the percentage should be statistically
significant. Hint: To test this, you need to split the
sample by race, use a one-sample t-test, use .121
(12.1%) as the test value, and look at the output for
African Americans only.
-
Statistically test whether two subgroups that are
coded on two different variables are statistically
different from one another or Americans in general
with respect to a specific attitude. Hint: To answer
this question, you again would need to split your
dataset and use a one-sample test. The test value will
be the mean for another group or for the sample as a
whole, which would be need to calculated separately
unless your instructor gives you that value.
-
Analyze and interpret chi-squared statistics for two
variables. Recall, that for this type of analysis, you
are working with a pair of variables that are dummy,
categorical, or ordinal meaures.
-
Analyze and interpret the correlation statistics for
a handful of variables. You could be asked to
determine whether they could be combined used as
independent variables in a regression models (i.e.,
would their likely be collinearity problems?). You
will to interpret the relationship between pairs of
variables and their statistical significance.
-
Demonstrate your understanding of the limitations of
correlation. You may be asked to speculate about
whether we can determine if there likely is a causal
relationship between pairs of variables, with one
clearly causing the other (e.g., being more
conservative and being Republican). You may be asked
to identify whether any of several other variables may
suggest a spurious correlation between two variables
(For example, being male and disapproving of Joe Biden
are modestly, but still statistically associated). You
might be asked why we see a very low or unexpected
association between two variables that probably have a
non-linear relationship (take a look at age and
household income in the dataset).
-
To run and interpret a linear regression model with
three independent variables. You will be asked to
identify and interpret the adjusted R-square
statistic. You also will be asked to identify the
equations for three scenarios that will involve
different levels of one of the independent variables
(including its mean, which you will need to calculate
with SPSS). The other two independent variables will
be set at their mean level of influence in the
equations.
-
To run and interpret the same linear regression model
examined before as well as up to 5 more independent
variables, including an interactive term and multiple
dummy variables coded from the same original variable.
You will discuss the model's r-square statistic and
how r-square statistics have changed as you have added
additional variables to the model and what that means.
-
For the same regression model, you will also be asked
about the statistical significance of different
independent variables, and the meaning of some
unstandardized and standardized (beta) regression
coefficients. You will be expected to identify the
correct reference groups when interpreting results for
one or more of the dummy variables. You will be asked
to interpret the results of an interactive term.
-
For the same regression model, you will asked what
equations should be used to estimate the predicted
value of a dependent variable for two or three
individuals who are different in specific ways,
controlling for their other differences Ii.e, with
some variable's set at their mean level of influence).
You will need to make sure you know how to create
scenarios that involve dummy variables' reference
categories and non-mean values for an interactive
term. For at least the practice exams, you may be
asked to use to Google to solve one or more scenario
models.
Topic 15 (Wednesday,
11/19) —Logistic regression and its
interpretation
-
In class, you will
be introduced to logistic regression, which is the
type of regression used when the dependent variable
is binary (i.e., when working with a dummy variable
as the dependent variable.
-
Ahead of class, take
another close look this short conference paper by Dr.
Setzler and Dr. Yanus. The reason you
are being asked to read this particular study again is
because it has several dummy dummy dependent
variables, so we will be able to use the same example
study when we look at logistic regression in the next
block of course materials. If you want to take a look
at another example, logistic regression is the type of
regression used in the article you previously printed
out on Brazil's 2018 election.
-
Ahead of class, please take the time to
carefully read and print out: this handout of SPSS logistic
regression output with annotations. The document is
very similar to the one I posted above for linear
regression, but this time the dependent
variable--support for torturing terrorism suspects to
obtain information--has been recoded into a dummy
variable. Respondents who said torture was sometimes,
often, or always justified were coded "1"; individuals
who said torture is never justified were coded "0."
-
In class, we will practice
interpreting logistic regression tables in the
Setzler and Yanus article. We also will
practice interpreting SPSS logistic regression output,
including interpreting pseudo R-square statistics,
statistical significance p-values, odds-ratios exp(B),
and Wald statistics,
-
Some key concepts to
remember from class:
-
If your dependent variable
is dichotomous, you need to use logistic
regression rather than linear (OLS) regression.
Dummy variables are used as
independent variables in OLS/linear regression
without any type of mathematical conversion, but when dummy variables
are the dependent variable, we need to a different
type of regression because each one unit increase
in a given independent variable typically does not have the same effect on
a dummy dependent variable. For example,
assume that the number of injuries a football team
has will decrease its number of points in the
typical game. OLS regression will work well because
it will tell us how many points are lost with each
additional injury. However, if we wanted to know how
injuries will impact whether the team is likely to
win, we'd see that the first injury to two probably
has a modest effect; however, after a certain
threshold, each additional injury will the team's
odds of winning to go down substantially. And, after
a certain point, we'd see additional injuries won't
harm the team's prospects any more because they
already are going to lose. Logistic regression
models address the fact that the effect of an
increase of independent variable on an outcome
varies in predictable ways that linear OLS
regression cannot capture.
-
Use bivariate (aka
binary) logistic regression only with
dichotomous
dependent
variables
that have been coded zero or one. This
type of regression is often the best option when you
are dealing with ordinal or categorical variables,
but you need to covert those types variables into
dummy dependent variables for this type of
regression. There are other types of logistic
regression designed for ordinal and multi-category
dependent variables, but they are less frequently
used and beyond the scope of this course.
-
Interpreting
logistic regression output is not very intuitive, so why do we need to
learn this?
You see logistic regression models' estimates all of the time even if
you haven't realized it. Many things we
want to know about--what factors are most important
in determining who will win elections, whether
countries go to war under certain circumstances,
whether someone has been asked for a bribe--are
yes/no variables that require logistic regression.
And, as we have learned previously, it is very
common to convert ordinal variables--especially
Likert scales--into dummy dependent variables. When
researchers do this, they use logistic regression.
-
With SPSS's logistic regression output, the "pseudo r-square
statistic" (use the one labeled Nagelkerke), is
interpreted just like the r-square statistic
in OLS regression. So, if a regression model has a
pseudo R-square of .0897, it means that the
variables in that model collectively explain about 9
percent of what causes the outcome predicted by the
model.
-
To identify which
variables are most important in explaining the
outcome, compare the Wald statistics. The
variables with the largest Wald values explain more
of the variation in the dependent variable. You may
recall that there is a similar statistic for linear
(OLS regression): standardized regression
coefficients, also called betas.
-
Remember, we only
interpret the specifics of any regression
coefficient--including odds ratios--if that
variable is significant (i.e., the p-value
for the odds ratio is equal to or less than .05). If
the p-value is greater than one, you interpret the
variable by saying something like, "This independent
variable is not a statistically significant
predictor; we cannot be confident that repeated
sampling would show that this variable will be
consistently associate with an increased (or
decreased if the odds ratio is negative) likelihood
of the outcome."
-
In SPSS output,
the odds ratios are listed in the right-most
column of the last block of output in the column
labeled ex(B).
-
Independent
variables work the same way in logistic and linear
regression models. We interpret dummy
variables and interactive terms the same way for
both types of regression; it's the
phrasing and explanation of how these variables
influence the dependent variable that is different.
Specifically:
-
If an independent
variable's odds-ratio is less
than one (and statistically significant),
there is a
negative
relationship between
the variables. Every one
unit increase in the independent variable
corresponds to a decreased "likelihood" in the
dependent variable happening. One way to interpret odds ratio is
to think about starting out with one dollar and then
looking at the odds ratio. If you used to
have a dollar and now have 70 cents, you could say
that you have 30% less than what you used to have
could say that you now have only 70 percent of what
I once had. If we were predicting who voted in the
last election, and an interval variable's odds-ratio
was .250, we would say that every one unit increase
in that independent variable "reduced the
likelihood" of voting by 75%. For a dummy variable,
the interpretation is similar, only it notes the
reference group. For example, "Compared to eligible
voters who are older than 25, individuals who are
under 25 were only 25% as likely to vote."
-
If an odds-ratio is
between 1 and 2, there is a positive
relationship, and we can still convert that
odds-ratio into an easy-to-understand percentage.
If I used to have a dollar and now I have $1.67, we
could say that I now have about 70% more. Or we
could say I have two-thirds more than what I had
before. In a voting model, if a variable's
odds-ratio is 1.545, we would say that every one
unit increase in that independent variable
"increased the likelihood of voting" by over 50%.
Alternatively, we could say that a one-unit increase
in that independent variable increased the
likelihood by over one and a half times. If we were
predicting who voted in the last election, and a
variable's odds-ratio was 1.256, we would say that
every one unit increase in that independent variable
"increased the likelihood" of voting by just over
25%. Again, a dummy variable would need to note the
reference group. So, if a model had both a
Republican and a Democrat dummy variable, we could
say: "Compared to independents (the omitted
reference group), Democrats were over 25% more
likely to vote."
-
Finally, if we have
an odds
ratio of 2 or
more, we can say that every one-unit increase
in the independent variable increases the
likelihood of the outcome
by x-times. If you were predicting a voting
model and a predictor's odds ratio was 2.434, we
would say that each one unit increase in this
variable, "increased the likelihood of voting" by
nearly two and a half times. And if the independent
variable is a dummy variable, we would say something
like, "Compared to [everyone belonging to the group
omitted from the regression model], [people in the
dummy-variable group] were 2.4 times as likely to
have voted."
-
After class,
if you feel like you need more resources to
understand logistic regression, you have the option of
reading the first half of an
undergraduate research methods textbook
chapter on logistic regression. It
will be faster, and a bit-easier-to-understand reading
if you wait to complete this reading until after we
have covered logistic regression in class. Do not get hung up on the
overly complex
explanations. While this outtake is from one
of the most-assigned political science textbook in
country (this class used it for years), the detailed
mathematics in the chapter are no more essential to
your competent use of logistic regression than
understanding the inner-workings your car engine is
necessary for you to be an excellent driver. I am
assigning it to you because its explanation of why we
need to use logistic rather than linear regression
with a dichotomous variable is helpful. Carefully read
through the extended example that talks about how we
model something like the strength of a person's
partisanship and vote choice. The other useful part of
the chapter is its explanation for why odds-ratios are the
statistic that we use to interpret the influence of each variable in a
logistic regression model.
-
After class, if were
engaged in class and read the optional textbook
chapter material on logistic regression and still
feel like you would benefit with more information on the basics of this type
of regression, you can optionally
watch
watch this 12 min. screencast
(https://youtu.be/uUf3h8ifZxE). The screencast
explains in detail what bivariate logistic
regression is, how it works, and why it is often
used with ordinal dependent variables after
they have been recoded into 0/1 dummy variables. This
screencast covers concepts rather than SPSS; there is
one below that looks at how we use SPSS to run and
interpret logistic regression models. Note that
watching either of the screencasts on logistic
regression is optional and should not be necessary if
you attend class and put your best effort into the
hands-on practice exercises.
Topic 16 (Friday 11/22)—Let's do and
interpret bivariate logistic regression with SPSS and
scenarios. This is the last new topic that will
be covered this semester. While logistic regression will not be part of your
end-of-term SPSS test, your final exam will ask you to
interpret a logistic regression output table from SPSS,
and you will complete a BlackBoard exercise on this topic,
starting on Wednesday.
-
By the end of
Wednesday's class: If you have OARS accommodations
that you intend to use for the final exam, you need
to let your instructor know now! OARS has a
deadline in place. If you do not request
accommodations in advance, you will not be able to use
them during the final period. If you intend to write
your research proposal during the final period, you
likely will require the full three hours to do A-level
work on the test and proposal; students who submit the
proposal ahead of time will take only a 70-minute test
during the final exam period.
-
Before Wednesday's
class, carefully review the schedule's summary
notes for Friday, 11/17, which is when we first
discussed the
logic behind logistic regression and practiced interpreting odds ratios in logistic regression tables
from the Setzler and Yanus article you were asked to
read ahead of the classes on linear regression and
then ahead of your introduction to logistic
regression. If you have any concerns about your grasp
on the basics of logistic regression, you should
review the optional materials linked above, including
a textbook chapter and two
screencasts that cover the same ideas we talked about
in class.
-
In class, we will continue to
practice interpreting SPSS bivariate logistic
regression output. If you have not done so already,
you should take the time to
carefully read and print out: this handout of SPSS logistic
regression output with annotations. The document is
very similar to the one I posted above for linear
regression, but this time the dependent
variable--support for torturing terrorism suspects to
obtain information--has been recoded into a dummy
variable. Respondents who said torture was sometimes,
often, or always justified were coded "1"; individuals
who said torture is never justified were coded "0."
-
In class, we will practice calculating logistic
regression scenarios. As with linear
regression, interpreting logistic regression analysis
in in an interesting way is best done with "predicted
probability" scenarios. While you can quickly
calculate linear regression scenarios with a
calculator or Google, you will be asked download and use an instructor-created Excel
worksheet to estimate the predicted probabilities for logistic regression models.
If you do not have Excel on your computer, you can
download and use Office365 with your HPU credentials.
-
As I have emphasized in our classwork and discussions
on logistic regression, you are NOT
expected to be able to explain in any detail how logistic
regression works, any
of the
statistical calculations SPSS uses to estimate "odds-ratios,"
or the specific mathematical equations that
are used to convert
an odds
ratios into a specific predicted probabilities.
For this reason, you were not required to closely
review the textbook chapter that was assigned as an
optional reading when we first discussed logistic
regression. Instead of assigning a detailed readings
from a methods textbook, here is a summary of key ideas and concepts that will help you to
better understand what you are looking at when you
interpret an
odds ratio or
calculate a
"predicted
probability":
-
For linear regression, creating scenarios is a
straightforward process because it is easy to write
out a regression equation (see above for an
explanation and examples) to calculate specific
scenarios from raw SPSS output. Unfortunately, it is impossible to
create scenarios with raw logistic regression
output without doing additional mathematical
transformations. The key issue with
logistic regression is that every one unit increase
in a dependent variable is not expected to have the same
effect on the probability of the dependent variable.
Overcoming this issue requires calculations that
involve mathematically transformed odds that are
then converted into probabilities. Below, an example
looking at the impact of each additional injury on a
basketball team's probability of winning will
examine how odds can be converted into
probabilities.
-
With logistic
regression, scenarios involve calculating the predicted
probability of doing or believing
something at a given value of the independent
variable, with the effect of all other variables
held constant at their mean.
-
What is a
predicted probability, and how is it
different than an odds-ratio, the latter of which
is listed in SPSS's default output? The
answer to this question is explained in the scanned
textbook chapter that was optionally assigned when
you were introduced to logistic regression. Below is
a summary of its key ideas (with simplified
examples):
Odds and
probabilities are different ways of conveying the same
information, and you can mathematically transform any
odds statistic into a probability. Specifically,
the odds of an outcome refers to the number of times the
outcome is expected to occur compared to the number of
times the outcome is expected to not occur. Probability is
the number of times an outcome is expected to occur
compared to the maximum number of times the outcome could
possibly occur. For example, if a team has a 50% chance of
winning a game, we expect it to win 1/2 of its games.
Thus, the odds of the team winning are 1:1, indicating
that for every win, the team should have one loss. And, if
a team has a 75% chance of winning, we it should win 3/4
of its games, and its odds of winning are are 3:1.
Logistic
regression output is typically reported in tables that
list "regression coefficients" or "odds ratios," the
latter of which are transformed (specifically, exponentiated) versions
of the raw coefficients. SPSS reports both values. Without
transformation, raw logistic coefficients are not
interpretable, and typically articles that report
statistical results in this format discuss odds-ratios or
predicted probabilities in the body of the
article.
An odds
ratio tells us how much a one-unit increase in a given
independent variable decreases or increases the odds of
an outcome, which can then be converted into a predicted
probability. An odds ratio of .600 says that each increase in the
independent variable results in the odds of the outcome
being .600 times what it was before the increase (i.e.,
40% less likely). An odds ratio of 1.20 says that each increase in the
independent variable results in the odds of the outcome
being 1.20 times what they were before the increase (i.e.,
20% more likely).
So, let's say you are
planning to bet on a football team that has only 20%
chance of losing when it has no injuries. Another way of
putting this is to say that the team has 4/5 of a
probability of winning. So, for every four times that team
plays, we would expect it to lose once, which means their
odds of winning is 4 to 1. When you bet on this team, you
might place a four-dollar bet. If the team loses, you lose
your 4 dollars; if they win, you get a dollar, plus your
original four dollars.
What happens if your team
experiences injuries during the season? How will each
additional injury change the team's odds and probability
of winning?
We could run a binary
logistic regression model and its output might say that
each additional injury a football team will increase the
likelihood (specifically, the odds) of losing by 1.5
times. Using that odds ratio of 1.5, here is how the odds
and probability of the team winning will change as each
additional injury increases the odds of losing by 1.5
times:
0 injuries, odds = 4/1 (probability of losing = 20%)
1 injury, odds = 4/1.5 (pr. of losing = 27.3%)
2 injury, odds = 4/2.25 (pr. of losing = 36%)
3 injury, odds = 4/3.4 (pr. of losing = 45.8%)
4 injury, odds = 4/5. (pr. of losing = 55.9%)
5 injury, odds = 4/7.6 (pr. of losing = 65.5%)
6 injury, odds = 4/11.4 (pr. of losing = 74%)
7 injury, odds = 4/17.1 (pr. of losing = 81%)
8 injury, odds = 4/25.6 (pr. of losing = 86.5%)
9 injury, odds = 4/38.4 (pr. of losing = 90.6%
The reason we go through the hassle of mathematically converting odds into predicted probabilities in logistic regression scenarios is that most people find probabilities a lot easier to understand. Per the
results of a hypothetical scenario that will be posted
below, the odds of a football team with two injuries
winning their next game is .56 to 1, while the odds for a
team with six injuries are 2.85 to 1. In other words, a
team with six injuries should, on average, lose 2.85 games
for every game they win.
For most people, it makes a
lot more sense to convey the same information in
probabilities, which can be expressed as percentages: A
team with one injury has a 27.5% chance of losing, while a
team with six injuries can be expected to lose almost 75%
of the time.
-
And in order to calculate
scenarios
like those in the football example above, we first
have to
transform logistic
regression's output for the relevant variables.
While you can do these transformations in SPSS, it
is very complicated to do so (other statistical
programs make it much easier). Fortunately, you can do
all of the mathematical work in an Excel worksheet
if you know what formulas to use, and a
spreadsheet is a good place to manipulate
different variable values to create scenarios
anyway. Your instructor has put together an Excel spreadsheet for you that
calculates predicted probabilities for you,
depending on what scenarios you choose.
What to
remember from the screencast:
(1) First,
point-click-and-then-paste a logistic regression command
that includes all of your independnt and any control
variables (plus your dependent variable). Don't run the
command yet.
(2) Then,
point-click-and-then-paste a descriptives command into
syntax, using any random variable. When the command is
in your syntax, copy and paste to replace the random
variable with all of the independent and any control
variables in your regression model syntax (omit the
dependent variable). The point here is to make sure that
you can run a regression model and create a set of
descriptive statistics that will list variable results
in the exact same order.
Two things to
note that aren't in the screencast. First, you will make
things easier if you delete STDDEV from the descriptives
statistics syntax before you run it because you will
need output only for each independent and control
variable's mean, minimum, and maximum values. Second, I
ran the descriptive command and then logistic regression
in the screencast, but it will make it easier to find
the results you need if you run the logistic regression
first, because we are interested only very last block of
its output.
(3) Select
and run both of the commands, and then open up the Excel spreadsheet that your
instructor has created to assist you in creating
logistic regression scenarios (for the record, you
can compute scenarios in SPSS or even using Google or Wolfram
Alpha, but it will be much easier to use the
spreadsheet I have created for this purpose). That
spreadsheet is in the PPT folder and in one of the
subfolders in the workshop materials.
(4) Open the
last part of the SPSS logistic regression, specifically
output block that lists the coefficients, by double
clicking on it. Copy and paste the unstandardized
logistic regression
coefficient (the one in the B column) for the Constant into the
appropriate worksheet cell . Then, copy all of the other variable
names together with
their coefficients (the ones right next to the variable
labels in the "B" column), and paste them into the
worksheet in the columns that are labeled for this
output.
(5) Now, it
is time to work on the scenarios portion of the Excel
worksheet. Go to back to the SPSS output and
double-click on the descriptives output. Copy the means
for all of the variables, and paste them into the
worksheet column labeled "Scenario." We are doing this
because we want to be able to create scenarios that
involve some variables being set to certain values,
while the remaining values are set to their mean values.
(6) Now, have
some fun creating scenarios. There are two ways that
scenarios are frequently used in research, and both of
them are used in the paper you read earlier this term on
what kind of Brazilians voted for Bolsonaro in 2018,
which is why I assigned this article. First, there are a
couple of paragraphs in the article that compare
hypothetical individuals who are similar in all ways
except for a couple of characteristics in order to show
which variables had the most effect (partisanship and
ideology) and which had a small effect (sharing
Bolsonaro's illiberal views). Second, there several bar
charts showing how the probability that different kinds
of Brazilians voted for Bolsonaro changed if an
independent variable was at its lowest versus highest
value. Those bar charts were by using minimum and
maximum values as scenarios for each variable when all
other variables were held constant at average value.
(7)
Important and not mentioned in the screencast: If you
want to create a scenario for a variable that has
multiple dummy variables and a reference category,
enter zero for the the relevant dummy variables in the
model and one for the dummy variable you are looking
at. If the group that you want to look at is
the reference category (i.e., it wasn't included in
the regression models), then enter zeroes for the
other groups. For the example in the
screencast, to determine the probability that a typical
independent was going to vote for Hillary Clinton in
2016, the scenario needed to enter zeroes for the
variables Democrat and Republican while leaving the mean
values for all other variables. Because only partisan
dummy variables in the model were Democrat and
Republican using zero for each of these variables in the
scenario returned the predicted probability of voting
for the typical independent if the scenario values for
all other variables were left at their mean value.
End of the term
schedule. This year, Thanksgiving falls very late
in the semester. Because of this, the week of the the
holiday break and afterward will be spent mostly reviewing
or completing assessments intended to help you to prepare
for exams.
-
Monday November 25. Second practice
(ungraded), timed SPSS test on BB. Per
above, logistic regression will not be part of the
end-of term SPSS practice or final test.
The concepts that will be covered on
this test will be drawn from the same list I used (see
above) for your first practice test. Remember, you may bring a
page of notes with you for the SPSS tests. This
practice test and the final version will not have
detailed reminders of how to use SPSS for different
types of problems.
-
Tuesday,
November 26, by 5 pm. SPSS assignment
on logistic regression due.
-
Wednesday,
November 27, No classes.
-
Thursday,
November 28, 5pm. Thanksgiving.
-
Monday, December
2, SPSS test (10% of the course
grade).
This test will be very similar in format to the
practice test you took before the holiday break. The
concepts that will be covered on this test will be
drawn from the same list I used (see above) for your
first practice test.
-
Wednesday, December 4. Course wrap-up.
-
Your
final exam for
this class
will be when
the University
has scheduled
our exam
period. Consult the
University's calendar to verify your exact test
time. All students will be
required to take the third unit exam during the final
exam period.
|