Please note: At your instructor's discretion,
there may be minor alterations to the reading
assignments listed below. One of the major advantages to
providing you with an on-line readings archive is that
timely articles can be added or substituted when
appropriate. Opening documents downloaded from this
website will require that your computer have
Acrobat Reader . You will also need the
class-specific password to open individual files.
UNIT 3 ASSIGNMENT SCHEDULE
Links to helpful resources:
Week 12
For Monday (11/3), we
will continue with Topic 12, using spreadsheet software
to visually explore bivariate relationships between a
variable and one or more others.
- You were previously asked to print out and read this article on the election of
Brazilian president Jair Bolsonaro. Ahead of class, take a
look at Figure 2. Notice how a simple bar chart
can help us to see whether there is a relationship
between different independent variables and a dependent
variable (e.g., ideology and voting Bolsonaro).
-
After class, if
you need additional guidance on creating Excel
charts to visually summarize SPSS reults, you have
the option of
watching this screencast: https://youtu.be/T6kHpZ2oReQ.
It shows you how splitting data and calculating the
means for several different variables will allow you
to make a nice-looking chart in Excel to show your
results. Making this kind of a chart is a task you
will need to do for your next BlackBoard SPSS
Assignment.
-
Here are some of the
ideas covered in the optional screencast as well
as a few pro-tips on creating and formatting Excel
bar charts:
-
Before
you can create a bar chart in Excel, you first
need calculate the stats you need in SPSS:
(1) Split your dataset by an independent variable:
Data → Split File → Select the independent variable
(e.g., gender) → check “Compare groups.”
(2) Next, run a Frequencies command on your
dependent variable (e.g., whether someone voted for
a specific candidate). This will show you the
distribution of each independent variable group for
the dependent variable.
(3) You now have the data you need to create a chart
in Excel. Keep in mind that you will need to go back
to:
Data -> split file and select the option to
"Analyze all cases" to turn off the data split (or
you can change the independent variable and then run
another frequency)
-
To
save time, if you plan to create a bar chart that
uses data from several different splits, consider
doing most of your SPSS work in syntax. For
example, if you want to show the percentage who
voted for a candidate when split by gender, race,
and then partisanship, use SPSS’s point-click-paste
feature to generate a syntax template that splits
your dataset by one of your independent variables,
runs the frequency for your dependent variable , and
then unsplits the dataset (as described above). Once
you have that a syntax template for the first
independent variable group, you can quickly copy,
paste, and adjust template to quickly create the
data you need for each of your other independent
variables.
-
To
create your Excel chart, enter (or paste) your
SPSS results and bar chart labels into a blank
spreadsheet:
The name of each independent variable group
goes at the top a separate column. If there is more
than one category for the dependent variable, list
those categories in rows down the first column. For
example, if you were comparing a five-level measure
of household income for Democrats and Republicans,
your first column would list the five income ranges
starting in the second row. Columns two and three
would have “Democrat” and “Republican” entered at
the top, with the corresponding percentages of
respondents in each income category listed below.
-
To
create the Excel chart:
Select all of your your data and labels, and
go to Insert → Charts → 2D Column (or Bar).
In
Excel (will be different if you choose to use
Sheets) change the format of the numbers in your chart:
Format the original
data in the worksheet: Select all of the the
numerical data that needs to be formatted -> right-click (PC) or control+click (Mac) → select Format
Cells → Number → Decimal places = 0 (zero). This
changes chart labels (e.g., from 10.00 to 10) but not
the bar heights.
To
change the numerical range of an axis (say,
the auto generated maximum is 90%, and you want it to
top out at 100%):
Double-click on
the axis you want to change → right-click (PC)
or control+click (Mac) → Format Axis → adjust the
Minimum and Maximum bounds. If you don't see this
setting by default, choose the Format Axis tab that
looks like a little bar chart
To add
an axis label (e.g., add "Percentage of
respondents" to a y-axis that only has numbers):
Click on the center of the chart (a blank part of it
so you get the full table rather than an axis or a
bar) → select the green “+” (or "Chart Elements") icon
in the upper left-hand corner → check Axis Titles →
click inside the vertical axis box and type a label
(e.g., “% Trump Vote” or “Percentage Supporting
Trump”).
To add
a space between sets of bars that you want
grouped (e.g.,
between your gender and party identification data):
Insert a blank column or row between groups in the
spreadsheet where the data is.
To
change bar colors:
Select the bars → right-click (PC) or control+click
(Mac) → Format Data Series → use the
paint bucket or color options to choose new colors.
To add group subtitles
below groups of bars (e.g., add “Gender,”
"Partisanship," or "Education level"):
Highlight the cells above the “Men” and “Women”
columns that have their data → Use Excel's Merge
& Center option (home tab -> alignment) →
type a label (here, “Gender”) in the merged cell
Include the labels when you select information and
insert the bar chart).
- After class, start on SPSS #3 in
BlackBoard. This workshop will give
you a little more coding practice and hands-on practice
with interpreting descriptive statistics, making
bivariate bar graphs, and interpreting t-tests. Most of
this assignment can be completed after this class, but
we will be covering T-tests on Wednesday, so leave that
part of the assignment for later in the week. You
will need to complete this SPSS
assignment by 5 pm
this
Friday this
coming Monday (moved so we could
go over calculating T-tests in
SPSS on Friday as part of the
other work we will be doing) (the
assignment will not cover what we will be doing in
class on Friday).
Topic 13 (Wednesday, 11/5)—Determining whether two or
more variables' means are statistically different with
T-tests.
- Ahead of class, carefully read the first
sections of Chapter 8, "Bivariate Analyses" in
Carolyn Forestiere's textbook. Read just up
to the section on correlation analysis (i.e.,
just the first six pages of the chapter). The first part of the
chapter focuses on "statistical significance and
t-tests."
What is a statistical "test" of
whether two variables are associated, and how do we
know if the findings of that test are "significant"?
-
Why do we want
statistical tests to go along with the SPSS work we
have been doing so far? When we are working
with samples (e.g., a dataset that summarizes a single
survey of a sample drawn from a larger population), we
need to verify if what we are seeing in our dataset is
likely to hold up in repeated samples of the larger
population.
-
Statistical
significance matters because data analyses always
include some random variation. For example,
if you flip a coin 1500 times, are you always going to
get 750 heads, assuming that the coin is legitimate
and not weighted in any way?
No. Even though the true
probability of getting heads is 50%, with a sample of
1,500 flips, you will find that 19 out of 20 times you
flip a quarter 1,500 times, the number of heads you will
end up with will be somewhere between 712 and 788 heads.
If you flip a coin 1500 times, the number of heads you get
is a statistic that has a margin of error of 2.53% with a
95% level of confidence.
If you want to be 99% certain
of how many heads you will get if you throw a coin 1,500
times, your margin of error will be bigger. Statistically,
each time you flip 1,500 quarters you will get somewhere
between 700 and 800 heads, 99 out of 100 times you flip
all of those coins.
-
There are many
different tests of association, and you will be
learning just a few of them: the ones that
are most common in international relations and
political science. Whenever
we run a test—like a chi-square, correlation,
t-test, or regression—we get a p-value statistic
in our output, which tells us the probability of
seeing our results (or something even stronger) if
there were actually no real relationship in the larger
population.
-
By convention, we
typically consider a finding “statistically
significant” only
when a statistical test shows that there is a
very small probability that we would fail to find a
similar association in the same direction if we
analyzed many additional samples.
Barring unusual circumstances, the norm in
political science and international relations is to
highlight findings that have a p-value of .05 (5%),
.01 (1%), or .001 (0.1%)—that is, a 5%, 1%, or 0.1%
chance that the result would not be replicated in
repeated sampling.
When statistical results are
reported in a table or paper, you’ll often see either the
actual p-value or asterisks indicating the level of
significance. Unless otherwise noted, one asterisk (*)
means there is about a 5% chance that repeated sampling
would not find a similar result, two asterisks mean less
than a 1% chance, and three asterisks mean less than a
0.1% chance that the finding is due to random variation.
-
One important thing to always keep in mind is that
"statistical significance" is not the same thing as
substantive significance. When you are working
with large data samples, even very small differences
among groups may be statistically significant. It is
not uncommon to find instances where researchers are
stressing that a finding is "highly significant" when
the importance of that finding is not particularly
important.
What
is a t-test?
This is a
test that we use to see if two different groups have
different means for a dependent variable. If we
just want to see if how different groups of respondents
varied in how they answered a survey question, we can
split our data and then calculate descriptive statistics
for a dependent variable we care about. And it can be
helpful to display those differences in bar charts,
which can be quickly created in a spreadsheet.
However, how do we know if any differences we are
seeing in our sample are large enough that we would
expect that finding to hold for a larger, more general
population? To test whether this is the case, we use
T-tests. For example, we could use a T-test to see if
what we saw in our sample--that more Republicans make a
little more money than Democrats--would hold up if we
repeatedly surveyed representative samples of Americans.
When and how do we use
independent sample t-tests?
-
Forestiere's chapter talks the most commonly used
t-test, an "independent samples test." This test assumes you are
looking at whether two groups coded on the same independent variable have statistically
different means for a second variable.
Staying with the example I just gave, this test only
be appropriate if you have a categorical variable
where respondents were coded something like:
1=Democrat, 2=Republican, 3=Independent, and 4=Other
party.
Some things to remember from
the video:
(1) To run an independent
samples T-test; Analyze -> Compare means ->
Independent samples T-test. Then, select a variable
whose mean you want to examine across several
subgroups.
(2) You next need to
specify which values of the grouping variable will be
compared (click on the button that says "defined
groups). In the sample video, Republicans were coded
one; Democrats two, and Independents 3 in the original
dataset. To compare the means of Republicans versus
Independents, the values 1 and 3 would be specified.
(3) Make sure that you are
looking at the correct block of results and the correct
column to determine if the difference in means in
statistically significant. The significance test you
want is in the bottom block of output (not the "Group
Statistics," but rather in the block "Independent
Samples Test." In that block, look at the top row of
results ("Equal variances assumed") and find the column
labeled Sig. (2-tailed). To repeat, the one you are
looking for is in the row for "Equal variances assumed."
(4) Only if two-tailed
significance statistic is SMALLER than .05 can we say
with any confidence that the mean values for the two
groups are statistically different and that we would
reach the same conclusion if we drew repeated samples
from the same population; conversely, a significance
statistic that is LARGER than .05 indicates that the two
groups do not have statistically different means.
When and how do we use single-sample t-tests?
-
Forestiere's chapter does not discuss a second type
of t-test, the single sample t-test, which can do
everything an independent-sample T-test does and more.
Since she doesn't cover this second type of T-test,
please read through the paragraphs below very
carefully.
-
A single sample t-test determines whether the
mean value for a group on a dependent variable is different than the
mean value for another group (or groups) that are
measured by other variables.
Your textbook refers to the "DataPrac" survey, which
is a dataset that comes with your textbook so that you
can practice methods discussed in the book. We are not
using that dataset this semester; however, if you were
to analyze the DataPrac survey's variable D72, you
would see that the typical American (i.e. the survey
mean) had a response value of 7.17 on a 10-point scale
that measures religiosity. For this indicator,
respondents were asked to place themselves on a scale
where 1 means that God is "not at all important" in
their life and a ten indicates they believe God is
"very important" in their lives.
Is the sample's mean value for
this religiosity different than the mean for individuals
who say they plan to vote for the Democratic candidate in
the next election? How about Republicans? Is the
religiosity of men lower or higher than the national
average for this item? How about women? And what if we we
didn't want to compare the average religiosity of women
with that for the national as a whole, but instead wanted
to compare their average with the mean for Democrats?
We can answer each of these
questions and even build a table or bar chart comparing
them if we split our data on each of the relevant
independent variables and then run a single sample T-Test (Analyze-> Compare
Means -> One Sample T-test) for each group we
care about. For most of the tests, we would enter the
average for the sample as a whole, 7.17, into the place in
SPSS that asks for the "Test value." Or compare women to
Democrats, we would separately calculate the mean
religiosity for women and use that as our test value,
So, if were split the data by
the DataPrac variable D14 (partisanship) and run a one
sample t-test in SPSS with a test value of 7.17, we see
that the mean score for Democrats on the importance of God
in their life is 6.60, and the two-sided significance T
test reports that the difference for the test value and
the mean for Democrats is significant at the .001 level.
The same test shows that the average Republican score for
the importance-of-God measure is 8.25, which is
significant at the <.000 level. In other words, if we
survey similar samples 1000 times, we would always expect
to find that the average value for Democrats on this
variable was always lower than the national average. And
separate t-test (because the partisanship variable was
split) shows us that the average for Republicans should
always be higher than the national average.
To see if men or women also
are different from the national average for this
religiosity variable, we would just need to go back to
Data-> Split File -> Compare Groups and swap out the
variable D14 for the gender variable (we would leave the
test value at 7.17, the average for the full sample).
-
Optional: Watch after
class, if you need more guidance on
calculating an one sample t-tests: https://youtu.be/paUIJ3Eh7JI
(a little over five minutes). In the video, a test is
run to see whether the average for the variable male
(coded 0/1) is different than the value of .50, which
is about the percentage of men we would expect to find
in a nationally representative sample (e.g., it would
be due to something other than sample error if we were
to find that 60% of a 3000-person random sample was
male.
Some things to remember from
the video (so you don't need to watch it more than
once... or maybe even at all):
(1) To run a one sample
T-test; Analyze ->Compare means -> One
sample T-test. Then, select a variable whose mean you
want to examine.
(2) You next need to
specify a "test value" to which you want to compare the
mean for your variable of interest. In the video, the
mean for the variable male is .54, which is compared to
an expected value of .50. Per the commentary above, the
test value of .50 was used to see if there are more
males in this sample than one would expect to find in a
nationally representative survey.
(3) Only if the
significance statistic for the "two-sided p" result is
SMALLER than .05 can you say with any confidence that
that the mean value observed in the sample is truly
DIFFERENT than the expected value; a significance
statistic that is LARGER than .05 indicates that the the
observed and expected mean are not statistically
different. In this particular example, a value greater
than .05 would mean that the sample's portion of men is
not larger than the 50% would expect to find in a
nationally representative sample.
(4) It is not covered in this screencast, but
remember from the religiosity example above, if you
want to test whether the mean for a subgroup (perhaps
women, or Republicans, or Catholics) is different than
some value (maybe 50%, or the sample average, or the
mean of some other group you care about), you can
split your data to isolate the subgroup and then run
the T-test).
Topic 14 (Friday, 11/7)—Determining whether and how much
two variables are "associated" with one another
-
SPSS 3 assignment
due by 5 pm this
Friday
this coming
Monday (moved so we
could go over
calculating T-tests in
SPSS on Friday as part
of the other work we
will be doing) (the
assignment will not cover what we will be
doing in class on Friday).
-
Before class,
please finish Chapter 8 in your Forestiere textbook (you
previously read up through the section on t-tests,
so start on the
section for correlation). Please read the
political science examples carefully.
-
Before class, but
after you have read about correlation tests in
the textbook, read through the block of
material below carefully and quickly read a very short
reading on what a chi-square test is to give you a
clearer idea what SPSS is doing when it runs this kind
of an association test. Chi-squared tests are not
covered in your textbook, so you need to review
this statistical measure in the material below and
the assigned, short reading.
-
Before class,
print out and have handy this annotated SPSS output for a
Chi-Square test. In the sample, the
researcher is trying to see if a Brazilian's race (a
categorical variable) had anything to do with whether
or not the voted for the politician who was elected
president in 2018.
-
In class, we
will continue to focus on three of the many methods
social scientists use to determine how two variables
are associated: chi-square tests, correlation tests,
and bivariate regression.
-
After class,
please read the first 15 pages or so of Chapter 9,
"Regression," in
your Forestiere
textbook (up to
page 199). Review the section on regression
with one independent and dependent variable. It will be faster
reading if you wait to complete this reading after we
have begun to discuss regression in class. We may well not begin to
cover regression in Friday's class, but I want to
separate out your reading so that you are not being
asked so read an excessive amount for next week.
For
X2 (chi square) and other association statistics,
here are the key concepts you need to remember from
class:
-
Because they are the most-used association tests in
political science and international relations
research, we will spend quite a lot of time
focusing on correlation and regression. They
are the only association measures covered in any
detail in the Forestiere textbook.
-
Other than
correlation and regression, the only association
statistic you need to be familiar with for this
course is the chi-square (x2) test.
-
To calculate a Chi-square test, use SPSS's
Analyze -> Descriptives-> Crosstabs, If
you think that one one variable is the cause of the
other, the independent variable (the cause)
typically goes into the rows, while the dependent
variable should be listed in columns window. To make
the table useful, go to the "cells" option, and
check only the box for "rows." Check also, the boxes
for "observations" and "percentages." Then, in the
"statistics" option, check the box for a Chi-squared
test. If you need more guidance on this
procedure, you can watch this screencast: https://youtu.be/7O3UTYL2A-I
(about
-
To get a basic understanding of what a chi-square
test is and how association measures work, read
carefully just the first seven pages of this document (Read
up to the section "residuals"). Here is a summary of
what the reading says, with a simplified example:
The main point of the reading
is that a chi-square test provides a statistical test
to determine whether any association between two
nominal/ordinal variables we see in our sample data is
due to chance. In other words, what is the
probability that repeated sampling would find that a
respondent's category for one variable would have nothing
to do with their category in a second variable.
An example can provide a basic
idea of what a chi-square test looks at. Let’s say that
that we have 1000-person sample where exactly half of the
individuals have identified as women and half as men. This
being a sample from an odd, hypothetical US state, we also
have a sample with exactly 50% Democrats, 50% Republicans,
and no independents.
If gender has no association
all with partisanship in our sample, we would expect to see that
25% of our sample is made up of female Democrats, 25%
female Republicans, 25% male Democrats, and the final 25%
male Republicans.
However, a hypothetical
analysis our sample might reveal that 30% of the
respondents are made up of female Democrats and 30% the
sample is male Republicans. Thus, in our sample, it looks
like there is an association between gender and
partisanship (specifically, more women than expected are
Democrats and fewer are Republican).
The chi-squared test will tell
us (and ONLY tell us) whether the association we see in
the sample between gender and partisanship is possibly due
to sampling-error chance. Specifically, the p-value for
the chi-squared will tell us what is the probability that
repeated sampling would sometimes find that women are more
likely than men to identify as Republicans which is
contrary to our hypothesis and the finding in our sample.
As suggested above in the
explanation of "statistical significance" a p-value of
.05 or smaller for a chi-square statistics tells us that
there is only a 5% chance or less that the association
we are seeing is our sample is due to chance (i.e.,
survey error, which a function of sample size) and that
we should expect repeated sampling to show a similar
association at least 95 percent of the time.
And, given the magnitude of
gender differences in the hypothetical sample above and
its size (n=1000), the chi-square test would be
significant in this case.
However, as is the case
with statistical techniques generally, if you are using
a very small sample or looking at a variable where you
have very few individuals in some the response
categories, a chi-square test may not return a
statistically significant result. This is why it is
important to run frequencies on variables and think
carefully about whether response categories should be
combined (e.g., it is very common to see a multi-racial
measure be recoded into a white/non-white dummy variable
before analysis if the sample is under 600 or so
resp0ndents).
If you want to be a
competent consumer of social science research, you
should be aware that there are other statistical
methods that can provide more accurate tests of
association when you are looking at the relationship
between any specific combination of two categorical,
dummy, or ordinal variables. If you are
curious, here is a summary of the association tests
SPSS can quickly compute: https://www.ibm.com/docs/en/spss-statistics/25.0.0?topic=crosstabs-statistics.
We are learning only about chi-square tests because
they are widely reported in both academic and everyday
publications.
Bivariate
Correlation
-
While chi-squared tests are common in the social
sciences, most political science and
international relations research instead examines
associations between categorical variables by
creating dummy variables and running correlation
tests among them. For example, rather than
calculating a chi-squared test to see whether a
three-category measure of party identification
(Democrat, Independent, Republican) is associated
with a five-category measure of religious
affiliation (e.g., Catholic, evangelical Protestant,
mainline Protestant, secular, other), many social
scientists create a “correlation matrix” using party
and denominational dummy variables. This approach
allows researchers to see how consistently adherents
of each denomination predict a person’s political
identity and whether those relationships are
positive or negative. For instance, identifying as
secular would likely be positively correlated with
being a Democrat, while identifying as an
evangelical Protestant would likely be negatively
correlated with being a Democrat.
-
Correlation tests can be used to test
associations between any combination of interval
(aka continuous) and dummy variables. If
one of your variables is an ordinal measure, that
measure can be can analyzed with correlation if the
ordinal measure is treated as an independent
variable or recode into a dummy variable (the choice
of which should be based on either logic or the
distribution of responses across the ordinal
measure).
For correlation statistics,
here are the key concepts you need to remember from
the chapter and class:
-
To run a correlation in SPSS, use Analyze ->
correlate -> bivariate. If you enter
more than two variables, you will get a "correlation
matrix," showing you the relationship between each
pair of variables.
-
Correlation measures should interpreted with close attention to whether or not they are statistically significant.
If the
p-value is higher than .05, there is no association regardless of how large the
correlation statistic is. For correlation,
the p-value tells you the probability that any
association between what has been found in the
sample could be zero or signed in the opposite
direction in repeated sampling. To say that one
variable is a statistically significant predictor of
another, the p-value needs to be .05 or less. In
SPSS, make sure to look at the p-value even if you
see two asterisks. For some odd reason, the default
setting in SPSS only adds two asterisks to
coefficients even when the p-value is <=.001,
which should be denoted with three asterisks.
-
Correlation does
NOT tell you how much a change in one variable
changes the other variable. It also cannot tell you which
variable may
be causing
the other to move. For example, being more
conservative is correlated with being more
religious, but there are theories to suggest that
causality could go either way.
-
Moreover, even if
two variables are highly correlated, it could be
that there is a third variable that is causing both x and y to
change in predictable ways even though those two variables have no actual
relationship. Social scientists often use
the term spurious
to describe a situation where the
mathematical correlation between two variables is
due entirely to a third variable. For example, in
the US, violent crime goes up in the same months
that ice cream consumption also goes up in a
population, but they don't have anything to do with
each other except that both are more prevalent on
hot summer days. "Omitted variable bias" is one of
the reasons we will be talking about multivariate
regression models next week.
-
Most of the association statistics range from -1 to
1, and a negative
correlation statistic means that increased values
in one variable is associated with declines in the
other variable. Typically, positive
correlation statistics are not marked with a plus sign.
-
The square of
the correlation coefficient (r-squared), is used
to estimate how much of the variation in one
variable is "explained" by the other one
with a key caveat noted above: a missing variable
may be explaining some or all the variation...
which is why we typically look at relationships
between two variables with multivariate regression
that includes one or more "control variables"
(more on that next week).
As a general guideline for thinking about the
association statistics, like correlation:
<.10 means that there is a very weak or no
association between the variables;
.20 can be interpreted as a meaningful but modest
association;
.30 is a moderate association, and
>.40 is a strong association,
But in every case, you need to put these findings
into context (e.g., a .40 association between
being Republican and being conservative would be a
much weaker finding than you would anticipate, so it
wouldn't make sense to refer to this scenario as being
evidence of a very strong association).
There can be a close
relationship between two variables, but correlation
won't measure it if the relationship is not linear. Think
about a person's age and their physical independence. For
a while, each year mean more physical independence, but at
a certain point, the relationship become negative; there
is a curvilinear relationship
between age and physical independence. Calculating a
correlation measure for this type of relationship, will
miss the connection.
With income
and happiness, increases in income initially push
happiness up consistently. However, at a certain point (about $100K annually in today's
dollars), more income doesn't seem to improve or
decrease happiness. (if you are interested, the full
article archived here). So, the relationship looks
sort of like a backwards 7. Economists refer to this
particular relationship as a diminishing returns curve.
And then,
there's the relationship between time and the growth of
invested money, AKA, the miracle of compound interest,
which looks kind of like going along a road with a small
climb and then starting up a mountain.
There are
advanced ways to test these kinds of non-linear but
clearly meaningful relationships, but those are usually
covered in graduate-level statistics courses. Fortunately,
we can use tools we already have learned to explore
non-linear bivariate associations. For example, to see
whether there’s a diminishing-returns pattern in income’s
effect on happiness, we could divide a sample of Americans
into seven household income categories and calculate each
group’s mean on a five-point happiness scale. A bar chart
of these means would likely show that happiness increases
as income rises, but by smaller amounts at higher income
levels. We could then run a few t-tests to see where the
differences in mean happiness between income groups stop
being statistically significant.
For linear regression with one independent
variable (i.e., bivariate regression),
here are the key concepts you need to remember
from the chapter and class:
-
Linear (i.e., OLS--Ordinary Least Squares)
regression models report an R-square statistic,
which is interpreted as noted above in the section
on correlation. An
R-square statistic of .35 means that the
independent variable in the regression model
explain 35% of the variation in the dependent
variable (and doesn't explain 65%). You
will get some sample language on reporting R-square
results in the section on regression with multiple
variables below.
-
Regression estimates how much a one unit
increase in the independent variable will
correspond to changes in the dependent variable. Specifically,
regression output includes a slope measure for each
independent variable. This statistic is
called the unstandardized regression
coefficient. In SPSS output,
unstandardized coefficients are listed in the "b"
column of output; make sure you are looking at the
first column output in the last block of output).
This regression coefficient tells us how much "each
one-unit increase in the independent variable
corresponds with an x-unit increase (or decrease if
the coefficient is negative) in the dependent
variable." In plain English, we might say each one
unit-increase in the 10-point measure of
religiosity, corresponds to a 1.34 point increase in
an individual's measure on the 10-point
ideological-conservativeness scale."
-
With regression, there is a statistical test for
each regression coefficient where the p-value
tells you the probability that relationship
between what has been found in the sample could be
zero or even run in the opposite direction in
repeated sampling. To say that one variable is
a statistically significant predictor of another,
the p-value needs to be .05 or less. These models
include a value for the y-intercept (in the output,
this is the unstandardized "constant).
-
With regression results, you can predict the
value of the dependent variable at selected values
of an independent variable. The constant
(aka, the y-axis intercept) can be used to predict
the value of the dependent variable for a given
scenario with a simple formula: DV value =
Constant + (a specified value of the IV times the
regression coefficient). If the IV is a dummy
variable, the language used to interpret its
regression coefficient is: "Compared to the
reference category of (carefully describe anyone who
is not in the group), individuals who are in the
group had an x-point higher (or lower if the dummy
variable coefficient is negative) value on the
dependent variable. In plain English, this might
sound like, "Compared to non-Republicans--that is,
Democrats plus independents--Republicans' score on
the 10-point measure of religiosity was 2.4 points
higher."
-
As with correlation generally, regression models
assume that every one unit increase in the
independent variable will have the same effect on
the dependent variable. This is referred to
as the assumption of linearity. Examples of
how variables can be related with one another but
not have a linear relationship include time and
investments (over time, investment returns compound
so growth is exponential) and the curvilinear
relationship between age and physical independence.
Regression can handle these types of relationships
in a few different ways, one of which is using dummy
variables and interaction terms (this isn't the most
common way, but it is the only way that fits neatly
with concepts you already are going to learn in this
class). If we suspected that age has a different
effect on income, a series of dummy variables (say,
reference group = under 35, with additional dummies
for 35-50, 51 - 65, 66-75, and over 75 years old)
likely would show that wage-earned income, on
average, quickly increases as one moves
.
-
Completely
optional: if you have attended class and
carefully read the textbook material on
correlation but feel like you would like to go
over the basics of this method one more time, you
can watch this 25-minute (12.5 at x2 speed)
screencast presentation covering the logic and
main concepts of correlation: https://youtu.be/pjDDBrunB1A.
Note: The screencast goes over the same conceptual
material we will have reviewed in class, and doesn't
cover the use of SPSS.
-
Completely
optional: if you have attended class and
carefully read the textbook chapter material on
bivariate regression but still feel like you would
like to better understand the basics of this
method, you can watch this 19-minute (10
at x2 speed) screencast presentation covering the
basics and logic of bivariate regression: https://youtu.be/K8A6xGIXPR8.
Note: The screencast goes over the same conceptual
material we will have reviewed in class, and doesn't
cover the use of SPSS.
Week 13
Topic 15 (Monday, 11/10; Wednesday 11/12, and Friday 11/10)—Linear
regression with multiple independent variables,
including dummy and interval variables
Monday's class will
be spent reviewing correlation and linear regression. On
Wednesday, we will spend more time on regression, leaving
Friday so that we can practice calculate scenarios (a term
that will make more sense by mid-week).
Important: The concepts
and SPSS work we cover this week will be the last
material that will appear on your SPSS test at the end
of the term (logistic regression--which we will be
learning about next week--will be a topic covered on
your last BlackBoard workshop and on the final exam).
-
SPSS 3 assignment
due by 5 pm Monday
(moved so we could go
over calculating T-tests
in SPSS last Friday.
-
SPSS #4
(posted to Blackboard) is due by 5pm on
Wednesday. This is the
assignment that gives you more practice coding dummy
variables and running/interpreting T-tests. The
assignment also covers the main concepts behind
Chi-squared tests and correlations as well as how to
run these analyses and interpret their results in
SPSS.
-
Ahead of Monday's class, starting with page
199, read to the end of Chapter 9, "Regression," in your Forestiere textbook (17pp).
-
Ahead of Monday's class, print
out this document ahead of class: A handout of SPSS linear
regression output with annotations.
The handout matches up with the same topic as the
screencasts (using SPSS to predict the level of
support for torturing terrorism suspects, as measured
by a 7-point likert scale (1 = "never justified"; 7 =
"always justified") that is being treated here as an
interval variable. You should retain your copy of this
document because you will find it handy when you
complete the BlackBoard assignment on
regression.
- Ahead of class,
take another quick look at this article on the 2018 Brazilian
Election. Read the abstract and
intro quickly and then go straight to the methods and
findings section. Read through the intepretation of the
logistic regression results.
The reason you are being asked to quickly read this
particular study again is because it has several dummy
dummy dependent variables, so we will be able to use the
same example study when we look at logistic regression
in the next block of course materials. For both linear
and logistic regression, the examples I show with SPSS
in class will be based on this article and its dataset.
Below are summary notes for the major concepts
covered in class and your textbook to understand how
multivariate linear regression works and is
interpreted:
General concepts for multivariate linear
regression (i.e., more than one independent variable),
Here are the key ideas you need to remember from
the textbook chapter on regression and class:
-
All of the key concepts listed above for
bivariate regression (i.e., one independent
variable) apply to multivariate regression, too:
The R-Square statistic, constant, statistical
significance statistics, are interpreted in the same
way. All regression assumes that each independent
variable has a linear effect with the dependent
variable (see what this means by reviewing the notes
under correlation and bivarate linear regression
that explain the "assumption of linearity").
Also, the individual unstandardized regression
coefficients are all interpreted a similar way as
they are with bivariate regression except
that results for each independent variable caculates
how much a one-unit increase in that independent
variable increases/decreases the value of the
dependent variable when the
influence of all other variables
in the model is held constant at their mean value.
For example, if you are looking at the effect of
each additional level of education on income and the
only other independent variable in the model is the
dummy variable Male, the regression results
for the education variable would be calculated with an
equation that controls for the effect of gender by
calculating .5 x the positive effect of being a male
(i.e., the regression coefficient for the gender
variable). Why .5? This is the mean value for male.
The interpretation of multivariate regression is a
bit more completed when you have interaction
variables (i.e. MaleXRepublican) or multiple
dummy variables for the same variable (e.g.,
race or political party dummies). These type of
variables are discussed below.
-
As with bivariate regression, multivariate
regression results allow you to predict the value
of the dependent variable under different
scenarios. The way this is done is to
assign the scenario values for any scenario you
want and independent variables' mean values
otherwise. More details on this below, but
this is the key idea for how scenarios work.
-
With multivariate regression, you determine
which variables are most important by comparing
their "standardized regression coefficients."
These are also called "betas" are located in the
SPSS output column labeled as such. Recall that we
can compare standard deviations of different types
of measures in useful ways. So, comparing a 34 score
on an ACT to a 1500 on an SAT is not easy, but
comparing how many standard deviations each of
scores is from their test mean would tell you that
that ACT score represents relatively higher
performance. A beta tell us how much each one
standard deviation in an independent variable
increases/decreases the dependent variable's value,
measured in standard deviations. In other words,
the farther away a given independent variable's
beta is from 0 (beta can be positive or negative),
the more important that particular independent variable
is in predicting the dependent variable's value.garding the use of
dummy variables,
- With
multivariate regression, there is the added
assumption that each of the independent variables
is at least somewhat independent of the others. If
two or more independent variables are very highly
correlated, the statistical results in the model may
be able to determine how changes in those two
variables influence the dependent variables. See
the note above on multicollinearity.
To use and
interpret regression output with dummy variables,
here are the key concepts you need to remember from
class and your textbook:
-
Critical: to
interpret any dummy variable in a regression
model's results, you have to know what the
variable's reference category is.
-
Whenever you
interpret
a dummy variable, your interpretation should explicitly identify the reference
category.
For example, if a regression model only includes the
variable Latino, the interpretation of that
variable's regression results for this variable will
start with this phrasing like, "Compared to the
typical non-Latino, the estimated household income
of a typical Latino is $1,300 less a year after
controlling for the other variables in the
regression model."
-
If there is just
one dummy variable in a model and it was coded
from an original variable that had just two
response categories, the reference category
is easy to identify. For example, we might have data
where respondents have been coded one if they
believe freedom is more important than equality and
zero if they think equality is more important than
freedom. If the regression model includes the dummy
independent variable FreedomIsMostImportant, that
variable should be interpreted with phrasing like
this: "Compared to respondents who prioritize
equality, those who think freedom is more important
had an x-point higher score on the 10-point
dependent variable measuring y."
-
If a regression
model includes a dummy variable derived from a
multi-category original variable, think carefully
about how many of those groups have dummy
variables in the model and thus what the proper
reference category is. Consider a
(logistic) regression model looking a persons'
characteristics to predict how much they think NATO
is important to international security, measured on
a 5-point scale. Let's say this model includes
"Democrat" as its only partisan dummy variable. If
so, the reference group when interpreting the
Democrat coefficient is non-Democrats (i.e., since
both independents and Republicans are not in the
model, both groups are the reference category). In
this example, the variable Democrat should
be interpreted with phrasing like this: "Compared to
Republican and independent respondents Democrats
x-point higher score on the 5-point indicator for
seeing NATO as important to international security."
-
If you are running
a regression model to test a hypothesis comparing
two groups, one of those groups needs to be the
reference category. Sticking with the
example of looking at how partisanship shapes voting
for female candidates (something Dr. Setzler has
written a lot about, incidentally), if we want to
compare the likelihood of Democrats of voting for a
female candidate to Republicans' likelihood, we
would need to add a second dummy variable to the
regression model for the people who are
independents. Once the regression model included
dummy variables for both independents and Democrats,
those variables regression coefficients could be
compared to Republicans, who would be the omitted
reference group.
To use and interpret interactive-term
variables, here are the key concepts
you need to remember from class:
-
We create and use
an interaction term in a regression model when we
think that the influence of one independent
variable on the dependent variable depends on the
value of a second independent variable.
For example, we might want to analyze how being a
political science major and the number of hours of
studied before an exam in general education
political science classes influences the typical
student's test score.
Let's say we collected a year's worth of survey data
on students' study habits and their test scores in
these types of classes. If we were to calculate a
regression model with just these two variables, we
presumably would find that both factors are
significant predictors of higher test grades but
that there are lots of other factors, too (i.e., our
model probably wouldn't have a high r-square
statistic).
If we saw that both being a major and studying
improved test scores, we might wonder if the effect
of each hour of additional study is different for
PSC and non-PSC majors. Maybe, studying pays off
more for non-majors because they have less of a
background in the subject area and more to learn. On
the other hand, maybe studying pays off more for PSC
majors because they have more interest in the
subject and are better able to retain information
about it.
To test either of these hypotheses, we need to
create and add an interaction term to our regression
model.
-
To create an
interactive term variable, you just create a new
variable that multiplies together each
respondent's value for the two relevant variables.
In the example above, we would use SPSS to create a
new variable with coding that looks something like:
COMPUTE NewVariable = first_IV * second_iv
So, in this case:
COMPUTE PSCxHrsOfStudyBeforeExam = PSC *
HrsOfStudyBeforeExam.
The"interactive term" would then be added to the
regression model along with the two variables from
which it was formed (both of the original variables
and the interactive term must stay
in the regression model). And then we would
rerun the regression model and look at our results.
-
If an
unstandardized coefficient for the interactive
term is positive and significant, we know
that the combination of the two variables has more
of a positive effect on the independent variable
than just the additive effect of each variables. In
the example, this would mean that each hour of study
is paying off more for PSC majors than non-majors.
-
If an interactive
term's coefficient is negative and significant,
it means that the combination of the two variables
has less than the full effect we would expect if we
were to add their full effect together. In the
example, this would mean that the effect of each
additional hour of study is less for PSC majors than
non-majors.
-
If an interactive
term's
coefficient
is not statistically significant, the value
of the second independent variable does not
influence the relationship between the first
variable and the dependent variable. In the example,
this would mean that there is no difference in the
grade improvement payoff of additional studying for
PSC majors and non-majors.
In Wednesday's class (or Friday's if we are running
behind), we will practice calculating scenarios with
regression output. What is a scenario, and
what key concepts do you need to remember from
class? The standard way to talk
about the effects of different variables in regression
models is to use a regression results table to explain
how a one-unit increase in a given independent variable
changes the value of the dependent variable when the
effect of all of the other variables is set at each
variable's mean value. While that makes for nice,
succinct tables, discussing situations that compare
different types of hypothetical individuals who are
otherwise similar can help to bring regression results
alive and make them more useful. For example, if
we were analyzing how much different kinds of people
support the legalization of marijuana, we could
calculate a regression model looking at the influence of
gender, religious denomination (i.e., dummy variables
for several of them), and age with controls for a
person's income and education. Using the model's
results, we could compare the level of support for a
30-year old male secular versus a 60-year old female
evangelical Protestant with comparable incomes and
educational backgrounds.
-
How do you calculate a regression scenario? To create
a regression results equation, first identify the
value of the unstandardized coefficient for the
"Constant" (use the value listed in the B column of
the SPSS results output box--the last one--that
lists each variable). Then, to that constant value
add the effect of each independent variable (i.e.,
its unstandardized regression coefficient)
multiplied by a value you specify. If you want
to control for any variables, their specified
values are their mean for the full sample.
-
Here's an example of a complex scenario (only
because of the number of variables is pretty high),
The example uses the descriptive statistics and
linear regression models reported in the Setzler
and Yanus paper you were asked to read ahead of
class. If you hadn't just read that study, I
would have used a simpler example.
Let's use the article's
regression tables and descriptives to create scenario
comparing two people's score on the 7-point measure
looking at a person's indifference-to-gender-equality (as
measured by how unimportant a person thinks it is to fight
for gender equality). What is the predicted score for an
older Republican male without a college degree who is
otherwise similar to other Americans (at least on the
variables in the model)? What is the predicted score for a
female non-Republican (i.e., independent or Democrat) who
is otherwise identical to her male Republican peer?
To calculate the expected indifference-to-gender-equality
score for the Republican male, we would use the following
formula:
The model's unstandardized constant (1.046)
+ 1 times the unstandardized
regression coefficient for Republican (i.e., the scenario)
+ 1 times the
Male coefficient
+ 1 times the
No college degree coefficient
+ 1 times the
aged 45 and older coefficient
+ 0 times the
age 30-44 year coefficient
And now we'll add in the controls, using their mean
values in the scenario):
+ .792 times the White coefficient (i.e., 79.2% of this
sample is were white)
+ .502 times the Religiosity coefficient
+ .441 times the Blue collar coefficient
+ .399 times the Rural coefficient
+ .500 times the Authoritarianism coefficient
+ .294 times the Racial animus coefficient
= The individual's expected value on the 7-point
indifference to gender inequality measure.
Formula in hand, you can plug the whole equation into the
super-cool online computer at https://www.wolframalpha.com/:
1.046 + (1 * .912) + (.792 * .119) + (1 * .444) + (1 *
-.097) + (1 * -.246) + (0 * -.078) + (.502 * -.082)+
(.441 * .190) + (.399 * .013) + (.500 * -.552) + (.294 *
2.69)
WolfFram Alpha tells us that all other things being
typical, an otherwise typical older, Republican male with
no college degree has an estimated
gender-equality-indifference score of 2.71.
Using the same equation in WolframAlpha and changing just
the scenario values for Republican and Male variables, we
see that the expected value for the female non-Republican
is 1.36 on the 7-point scale. That's just about half the
value we found for Republican males.
- Scenarios are helpful in understanding and
visualizing the effects of interaction terms. In
the example above, we might theorize that that being
not having a college degree makes men particularly
inclined towards sexist views. This is another way of
saying that the effect of educational attainment on
sexism is different by gender. To test this
hypothesis, we would create an interactive term (here,
the dummy MaleXNoCollege) and add it to our regression
model along with the variables Male and NoCollege. In
calculating our scenario, our equation would include:
(1 x the coefficent for Male) + (1 x the coefficient
for NoCollege) + (1 x the coefficient for
MaleXNoCollege). The scenario for value for
MaleXNoCollege is equal to the scenario values for
Male and NoCollege multiplied together.
Finally, I have compiled several screencasts on
linear regression that cover the same topics we will
go over in class. If you feel like you need
additional guidance, check out the optional
resources below. Watching any
or all of the screencasts should not be necessary if you
have attended class and put your best effort into the
hands-on practice exercises:
- Optional: After
practicing in class, if you still feel like you
guidance on calculating and interpreting linear
regression models in SPSS, review this 11 min.
screencast that https://youtu.be/xzl8OxPsM8s.
-
Also optional: This
13-minute screencast follows up on the last one with
a focus on how you
use SPSS to analyze dummy variables and how
output/tables with these variables are
interpreted: https://youtu.be/I2BEi_CkzK0.
-
Also optional:
https://youtu.be/3m66P8PaD3U. In about ten
minutes, this presentation
explains how we can use linear regression output
to predict the value of the dependent variable
with different scenarios involving the
independent variables (something I just covered in
the example above). The standard output in
regression will tell us how a one-unit or one
standard deviation increase in an independent
variable will change the dependent variable when the
effect of all of the other variables is set at each
variables mean value. While that makes for nice
charts, scenarios can help to bring the data alive.
For example, what is the level of support for
torture on a seven point scale for a 60-year old
Republican male who attends church a lot versus a
30-year old female secular Democrat? Here is a handout with the output and
calculations covered in the video.
-
One important
thing not covered in the last screencast is how
control variables work in regression scenarios. As noted
above, including controls is straightforward. Let's
say that we wanted to calculate the same regression
model and scenarios the screencast, but control for
the effect of a person's education. If education
were a five-point measure, we would calculate its
mean for the full sample and our scenario equation
would include one more added component: mean of Edu5
multiplied by this variables unstandardized
regression coefficient.
Friday, November 14.
Class time will be used to finish up our overview of
linear regression and--if time is available for you to
work on SPSS #5 in Blackboard.
Sunday,
November 16, by 5pm: SPSS #5 in
BlackBoard is due.
This assignment is focuses on linear regression,
which is the last topic that will be covered in
your practice and final SPSS tests.
Monday, November 17:
First practice SPSS
exam. This is the
first of the two mandatory, but ungraded SPSS exams to prepare you for
the graded exam that you will take later in the semester.
Remember that logistic regression will not be part of the
end-of term SPSS practice or final SPSS tests. You may bring a page of notes with you for the
practice and final SPSS tests.
For the practice and
final version of the tests, you could be asked to any
of the the
following
exercise (i.e., you may not bed asked to do
all of these things on any one test, but you will not be
asked to do anything that is not listed below):
-
Create a new variable. You could be asked to create a
new dummy variable that combines information from
either one or two original variables. You should be
able to label the new variable and its and response
categories.
-
Split a variable into its subgroups and compare the
frequency at which different subgroups for that
variable have a particular opinion.
-
Make a bar chart in SPSS (not Excel because
of concerns about how long that might take). Your bar
chart must show the percentage
of one of single variable's subgroups that have
a particular opinion (e.g., what respondents in
households that make $10K or less think about a an
issue).
-
Compare two or more variables' means and standard
deviations, explaining what the standard deviations
tell us about the distribution of each variable. For
example, comparing the typical and range of opinions
expressed by Republicans and Democrats about whether
"God has given the US a special role in human
history."
-
Statistically test whether the typical individual in
each of two subgroups that are coded on the same
variable are statistically different from one another
with respect to a specific attitude (For example, does
the typical African American think that "college is a
gamble" (versus being a smart investment" than the
typical Latino; is any difference statistically
significant)? Hint: you need an independent-samples
t-test to answer this type of question.
-
Statistically test whether the share of a specific
group in the sample is statistically different than
what it should be given a known parameter for the US
as a whole. For example, African-Americans make up
12.1% of the US population. What percentage of this
sample is African American? Are African Americans
underrepresented in the sample (when analyzed without
the survey's weights turned on). Is any difference
between the share of African Americans in the sample
and what the percentage should be statistically
significant. Hint: To test this, you need to split the
sample by race, use a one-sample t-test,
use .121 (12.1%) as the test value, and look at the
output for African Americans only.
-
Statistically test whether two subgroups that are
coded on two different variables are statistically
different from one another or Americans in general
with respect to a specific attitude. Hint: To answer
this question, you again would need to split your
dataset and use a one-sample test. The test value will
be the mean for another group or for the sample as a
whole, which would be need to calculated separately
unless your instructor gives you that value.
-
Analyze and interpret chi-squared statistics for two
variables. Recall, that for this type of analysis, you
are working with a pair of variables that are dummy,
categorical, or ordinal meaures. Typically, this
measure will be used with variables that have no more
than 5 response categories each.
-
Analyze and interpret the correlation statistics for
a handful of variables. You could be asked to
determine whether they could be combined used as
independent variables in a regression models (i.e.,
would their likely be collinearity problems?). You
will to interpret the correlation between pairs of
variables and their statistical significance.
-
Demonstrate your understanding of the limitations of
correlation. You may be asked to speculate
about whether we can determine if there likely is a
causal relationship between pairs of variables, with
one clearly causing the other (e.g., being more
conservative and being Republican). You may be asked
to identify whether any of several other variables may
suggest a spurious correlation between two variables
(For example, being male and disapproving of Joe Biden
are modestly, but still statistically associated). You
might be asked why we see a very low or unexpected
association between two variables that probably have a
non-linear relationship (take a look at age and
household income in the dataset).
-
To run and interpret a linear regression model with
three independent variables. You will be asked to
identify and interpret the adjusted R-square
statistic. You also will be asked to identify the
equations for three scenarios that will involve
different levels of one of the independent variables
(including its mean, which you will need to calculate
with SPSS). The other two independent variables will
be set at their mean level of influence in the
equations.
-
To run and interpret the same linear regression model
examined before as well as up to 5 more independent
variables, including an interactive term and multiple
dummy variables coded from the same original variable.
You will discuss the model's r-square statistic and
how r-square statistics have changed as you have added
additional variables to the model and what that means.
-
For the same regression model, you will also be asked
about the statistical significance of different
independent variables, and the meaning of some
unstandardized and standardized (beta) regression
coefficients. You will be expected to identify the
correct reference groups when interpreting results for
one or more of the dummy variables. You will be asked
to interpret the results of an interactive term.
-
For the same regression model, you will asked what
equations should be used to estimate the predicted
value of a dependent variable for two or three
individuals who are different in specific ways,
controlling for their other differences Ii.e, with
some variable's set at their mean level of influence).
You will need to make sure you know how to create
scenarios that involve dummy variables' reference
categories and non-mean values for an interactive
term. For at least the practice exams, you may be
asked to use to Google to solve one or more scenario
models.
Topic 16 (Wednesday,
11/19) —Logistic regression and its
interpretation
-
In class, you will
be introduced to logistic regression, which
is the type of regression used when the dependent
variable is binary (i.e., when working with
a dummy variable as the dependent variable).
-
Ahead of class, take
another quick look at this article on the 2018 Brazilian
Election. Read the abstract and intro
quickly and then go straight to the methods and
findings section. Read through the intepretation of
the logistic regression results.
-
Ahead of class, please take the time to
carefully read and print out: this handout of SPSS logistic
regression output with annotations. The document is
very similar to the one I posted above for linear
regression, but this time the dependent
variable--support for torturing terrorism suspects to
obtain information--has been recoded into a dummy
variable. Respondents who said torture was sometimes,
often, or always justified were coded "1"; individuals
who said torture is never justified were coded "0."
-
In class, we will start to practice
interpreting logistic regression tables in the
Setzler and Yanus article. We also will
practice interpreting SPSS logistic regression output,
including interpreting pseudo R-square statistics,
statistical significance p-values, odds-ratios exp(B),
and Wald statistics.
Some key concepts
about logistic regression to remember
Here's why: Dummy variables work fine as independent
variables in OLS regression, but when a dummy variable
is the dependent variable, we need a different approach.
The problem is that each one-unit increase in an
independent variable typically does not have the same
effect on a binary outcome.For example, assume that the
number of injuries a football team has will decrease its
number of points in the typical game. OLS regression
works well here because it tells us how many points are
lost with each additional injury. However, if we wanted
to know how injuries impact whether the team is likely
to win (a yes/no outcome), the relationship is more
complicated. The first injury or two probably has a
modest effect on winning. But after a certain threshold,
each additional injury causes the team's odds of winning
to drop substantially. Eventually, additional injuries
don't matter anymore because the team is already
overwhelmed and going to lose. Logistic regression is
designed to capture these non-linear patterns that OLS
regression cannot handle.
-
Use bivariate (aka binary) logistic regression
only with dichotomous dependent variables that have
been coded zero or one. This type of regression
is often the best option when you are dealing with
ordinal or categorical variables, but you need to
covert those types variables into dummy dependent
variables to use this type of regression. There are
other types of logistic regression designed for
ordinal and multi-category dependent variables, but
they are less frequently used and beyond the scope of
this course.
-
Interpreting logistic regression output is not
very intuitive, so why do we even need to learn
this?: You see logistic regression models'
estimates all of the time even if you haven't realized
it. Many things we want to know about--what factors
are most important in determining who will win
elections, whether countries go to war under certain
circumstances, whether someone has been asked for a
bribe--are yes/no variables that require logistic
regression. And, as we have learned previously, it is
very common to convert ordinal variables--especially
Likert scales--into dummy dependent variables. When
researchers do this, they use logistic regression.
-
Logistic regression provides information that is
very similar to what we get when we use linear
regression; however, the key statistics have
different names and are in a different place in the
SPSS output:
-
With SPSS's logistic regression output, the "pseudo r-square
statistic" (use the one labeled Nagelkerke), is
interpreted just like the r-square statistic
in OLS regression. So, if a regression model
has a pseudo R-square of .0897, it means that the
variables in that model collectively explain about 9
percent of what causes the outcome predicted by the
model.
-
To identify which
variables are most important in explaining the
outcome, compare the independent variables' "Wald"
statistics. The variables with the largest
Wald values explain more of the variation in the
dependent variable. You may recall that there is a
similar statistic for linear (OLS regression):
standardized regression coefficients, also called
betas.
-
With logistic
regression, we do NOT directly interpret the
unstandardized regression coefficients in
the first column of the variables table. Instead, we interpret
how much a change in each independent variable
affects the dependent variable by looking at the
odds ratios.
To interpret this, think about starting with one
dollar and then looking at the odds ratio. An odds
ratio of .70 means you now have only 70 cents—that's
30% less than what you had before.
If we were predicting who voted in the last election
and an interval variable's odds-ratio was .70, we
would say that every one unit increase in that
independent variable "reduced the likelihood of
voting" by 30%.
For a dummy variable, the interpretation is similar
but includes the reference group. For example, if the
odds ratio for "under 25" is .70, we would say:
"Compared to eligible voters who are older than 25,
individuals who are under 25 were 30% less likely to
vote."
-
If an odds-ratio is between 1 and 2, there is a
positive relationship, and we can convert that
odds-ratio into an easy-to-understand percentage.
Think of it like this: If I used to have a dollar
and now I have $1.67, I now have 67% more.
For example, in a voting model, if a variable's
odds-ratio is 1.545, we would say that every one unit
increase in that independent variable "increased the
likelihood of voting" by about 54%. Alternatively, we
could say it increased the likelihood by over one and
a half times.
For dummy variables, remember to note the reference
group. So, if a model had both a Republican and a
Democrat dummy variable, we could say: "Compared to
independents (the omitted reference group), Democrats
were 54% more likely to vote."
-
Finally, if we have an odds ratio of 2 or more,
we can say that every one-unit increase in the
independent variable increases the likelihood of
the outcome by x-times. If you were predicting
a voting model and a predictor's odds ratio was
2.434, we would say that each one unit increase in
this variable, "increased the likelihood of voting"
by nearly two and a half times.
And if the independent variable is a dummy variable,
we would say something like, "Compared to [everyone
belonging to the group omitted from the regression
model], [people in the dummy-variable group] were 2.4
times as likely to have voted."
This is optional reading, and something you probably
only want to read if you are interested in getting
into the weeds on this statistical method a level that
is not going to be necessary for your tests. While
this outtake is from one of the most-assigned political
science textbook in country (this class used it for
years), the detailed mathematics in the chapter are no
more essential to your competent use of logistic
regression than understanding the inner-workings your
car engine is necessary for you to be an excellent
driver. I am assigning it to you as completely optional
reading because its explanation of why we need to use
logistic rather than linear regression with a
dichotomous variable is helpful. Carefully read through
the extended example that talks about how we model
something like the strength of a person's partisanship
versus vote choice. The other useful part of the chapter
is its explanation for why odds-ratios are the statistic that we use to
interpret the influence of each variable in a logistic
regression model.
-
After class,
if you were engaged in class, choose to read the
optional textbook chapter material on logistic
regressionm and still feel like you would benefit
with more information on the basics of this type
of regression, you can optionally
watch
watch this 12 min. screencast
(https://youtu.be/uUf3h8ifZxE). The screencast
explains in detail what bivariate logistic
regression is, how it works, and why it is often
used with ordinal dependent variables after they
have been recoded into 0/1 dummy variables. This
screencast covers concepts rather than any SPSS work;
there is another screencast below that looks at how we
use SPSS to run and interpret logistic regression
models. Note that watching either of the screencasts
on logistic regression is optional and should not be
necessary if you attend class and put your best effort
into the hands-on practice exercises.
Topic 17 (Friday 11/22)—Let's
do and interpret bivariate logistic regression with
predicted probability scenarios. This is the last
new topic that will we will cover this semester.
While logistic regression will not be part of your end-of-term SPSS
test, your final exam will ask you to interpret a logistic
regression output table from SPSS, and you will need to
complete a BlackBoard exercise on this topic (SPSS #6).
-
By the end of
this week: If you have OARS accommodations
that you intend to use for the final SPSS exam or
for the final exam during finals week, please make
sure to make arrangements with OARS. OARS
has a deadline in place, and if you do not request
accommodations in advance, you may not be able to use
the testing center during the final
period.
-
Before Friday's class, carefully review the schedule's summary
notes from last time we met, which is when we first
discussed the
logic behind logistic regression and began to practice interpreting odds ratios in logistic regression tables. If
you have any concerns about your grasp on the basics
of logistic regression, you should review the optional
materials linked above, including a textbook chapter and the
screencast that covers the same ideas we talked about
in class.
-
In class, we will continue to
practice interpreting SPSS bivariate logistic
regression output. If you have not done so already,
you should take the time to
carefully read and print out: this handout of SPSS logistic
regression output with annotations. The document is
very similar to the one I posted above for linear
regression, but this time the dependent
variable--support for torturing terrorism suspects to
obtain information--has been recoded into a dummy
variable. Respondents who said torture was sometimes,
often, or always justified were coded "1"; individuals
who said torture is never justified were coded "0."
-
In class, we will also practice calculating logistic
regression scenarios. As with linear
regression, interpreting logistic regression analysis
in an interesting way is best done with "predicted
probability" scenarios. There are more details on what
this will involve below,
Some key concepts
about logistic regression predicted probability
scenarios to remember:
-
Instead of
assigning detailed readings from a methods textbook,
below you will find a summary of key ideas and
concepts to explain why and how we typically report
at least some of our logistic regression results
using "predicted probability" scenarios rather than
the odds-ratios that SPSS reports by default:
-
What exactly is a predicted probability, and how
is it different than an odds-ratio, the latter of
which is listed in SPSS's default output? Odds
and probabilities are different ways of conveying the
same information. Odds refers to how many times an
outcome occurs compared to how many times it doesn't
occur. Probability is how many times an outcome occurs
compared to the total possible occurrences. If this
sounds confusing, consider an example: If a team has a
50% chance of winning a game, we expect it to win 1/2
of its games, so the odds of that team winning are
1:1. If a team has a 75% chance of winning, it should
win 3/4 of its games, which means its odds of winning
are 3:1.
As was explained above (in the notes explaining how we
interpret raw SPSS output for logistic regression), SPSS
reports odds-ratios to show how a one-unit increase in
an independent variable changes the likelhood of a
dichotomous outcome. The reason we use odds-ratios
in logistic regression is because for most yes-no
outcomes, a one unit increase in an independent
variable does not have the same effect on the
probability of an outcome across the independent
variable's range. Again, an example will help to
illustrate a complicated concept:
Let’s say a football team has a 20% chance of losing
when fully healthy. That means its odds of losing are
just 1:4: one expected loss for every four expected
wins.
Now suppose each injury increases the odds of losing by
a 1.5 times. Because we multiply changes in odds, not
probabilities, the same increase in odds does not
translate into a linear change in probability. Using the
odds formula:
0 injuries: odds = 1:4 (probability of losing = 20%)
1 injury: odds = 1.5:4 (probability of losing = 27.3%)
2 injuries: odds = 2.25:4 (probability of losing = 36%)
3 injuries: odds = 3.375:4 (probability of losing =
45.8%)
4 injuries: odds = 5.06:4 (probability of losing =
55.9%)
5 injuries: odds = 7.59:4 (probability of losing =
65.5%)
6 injuries: odds = 11.39:4 (probability of losing =
74%)
As you can see, the the same increase
in the odds of losing with each additional
injury does not have the same effect on the probability
of losing, but, knowing how much each injury
increases in the odds of losing allows us to predict the
probability of losing at a given number of
injuries.
The reason we go through the hassle of
mathematically converting SPSS's default output of
odds-ratios into predicted probabilities is that most
people find probability scenarios a lot easier to
understand. For most people--whether academics or
everyday folks--it makes a lot more sense to convey
information in probabilities, which can be expressed as
percentages: In the example, a team with two injuries
has a 36% chance of losing, while a team with six
injuries can be expected to lose 74% of the time.
-
How can we quickly calculate predicted
probabilities for specific scenarios using nothing
but SPSS output and the assistance of generative
AI?: To calculate predicted probability
scenarios like those in the football example, we have
to mathematically transform logistic regression's
output for the relevant variables. While you can do
these transformations inside SPSS, it is a very
complicated, multi-step process to do so (other
statistical programs make it much easier).
Fortunately, we can use generative AI to accurately do
these calculations as long a you give it the right
SPSS output and an effective prompt. Here's how:
To calculate a predicted probability scenario with
the assistance of Claude.ai or ChatGPT, do the
following:
(1) Use SPSS point-and-click options to build a
logistic regression model. But rather than running the
model, use the paste option to put the command code for
the logistical regression command into a syntax file.
(2) Next, copy all of the independent variable names
from the syntax you just pasted. And on a new line,
write the word DESCRIPTIVES, paste in the variable
names, and add a period to the end of the command.
(3) Next, select the blocks of code for the logistic
regression model and the full descriptives command, and
run that code.
(4) You now should have SPSS output that includes the
regression coefficients table (the last table in the
logistic regression output) followed below by a
descriptive statistics table that lists all that model's
independent variables.
(5) Next, you need to create a text template that you
will use for your scenarios. To do so, open up the
descriptives table output table and copy the column of
independent variable names. Paste those names into your
text document, and add "= mean" after each of them. For
example:
Republican = mean
Democrat = mean
Age in Years = mean
Male = mean
and so on...
(5) Copy the SPSS table that includes the regression
coefficients table (the last table in the logistic
regression output; the one that lists your odds ratio)
and paste it into either Claude or ChatGPT.
(6) Then do the same thing with the descriptive
statistics table output.
(7) Now tell the AI engine to get ready to calculate
the predicted probabilities:
"I want you to calculate some predicted probability
scenarios for me, using the SPSS output for my
descriptives table and the logistic regression model.
In each scenario, I will provide the settings for the
independent variable values I want analyzed, assuming
that all other variables in the regression model are
set to their mean values. Are you ready for
scenarios?"
- Now you are ready to
provide the scenario information and calculate
predicted probabilities:
Copy the scenario
template that you created earlier, and paste that into
the AI engine (you just told it to get ready
for scenarios). Modify
the values of the independent variables to fit your
scenario, changing ONLY the values related to the
scenario you want calculated. For example, if
you want to know what the probability of a 35-year-old
male Republican thinks the US is losing its cultural
identity, you will change the age variable from mean to
35, the Republican variable to 1, and the Male variable
to 1.
Leave all other
variables the way they are (i.e., at their means)
unless one of the variables in your scenario is
mutually exclusive of another variable in the model.
Staying the same example, if you need the predicted
probability that a 35-year-old male Republican thinks
something, any other party identification variables need
to be set to zero. For example:
Republican = 1
Democrat = 0
Age in Years = 35
Male = 1
And the other variables would all be = mean.
If you want to
calculate a scenario that involves a reference
category dummy variable, set all of the related
dummy-coded variables to zero. So, if
Democrats are the reference category for your other
dummy-coded party variables and you want to calculate
the predicted probability of that a Democrat did or
thinks something, the other party-identification
variables will all be set to zero
Finally, If you have
an interaction term in your regression model, its
value may need to be changed to match the scenario.
For example, let's say you had a three-variable model
predicting whether teams will win a game: N.injuries (to
key players), PlayingAway, and injuriesXplayingAway. The
interaction term is testing whether the effect of
injuries on the likelihood of winning is different when
playing away. If you were calculating a scenario for the
probability of winning when playing away with three
injuries, your scenario will include
"injuriesXplayingAway = 3" because 3 (injuries) × 1
(dummy for playing away) = 3. If the scenario looked at
playing at home, both PlayingAway and
injuriesXplayingAway would be set to zero.
You will be happy to know that the block of material on
logistic regression ends the new material for the
course.
This is what the end of the term schedule looks like:
This year, Thanksgiving falls very late in the semester.
Because of this, the week of the the holiday break and
afterward will be spent mostly reviewing or completing
assessments intended to help you to prepare for exams. The
last week that you will have new material in the course
will be week 14
.
-
Monday, November
17: First practice
(but mandatory) SPSS exam.
This is the first of the two mandatory, but ungraded SPSS
exams to prepare you for the graded exam that you will
take later in the semester. Remember that logistic
regression will not be part of the end-of term SPSS
practice or final SPSS tests.
You
may bring a page of notes with you for the practice
and final SPSS tests. The practice tests are
mandatory; if not completed, zeroes will be entered as
additional SPSS homework (BlackBoard) grades).
-
Wednesday November
19 and Friday 21: We will cover logistic regression
in class.
-
Monday,
November 24. Second practice (ungraded,
but timed and mandatory) SPSS test on
BB. The
concepts that will be covered on this test will be
drawn from the same list I used (see above) for your
first practice test. Remember,
you may bring a page of notes with you for the SPSS
tests. This practice test and the final
version will not have
detailed reminders of how to use SPSS for different
types of problems. The practice tests are mandatory;
if not completed, zeroes will be entered as additional
SPSS homework (BlackBoard) grades).
-
Wednesday,
November 26 and Friday, November 28: No
classes (Thanksgiving)
-
Monday,
December 1. Normally, the SPSS test would be
on this day, but concerns about air-travel delays have
led me to put it on the last day of class. The
expectation is that you will be in class on Monday,
barring unplanned travel issues. This will be a day
for you to ask questions about the final exam and then
to work on either your logistic regression assignment
or to practice more for the SPSS test.
-
Tuesday,
December 3 by 5 pm. SPSS
#6 assignment on logistic
regression due.
This is the last SPSS Blackboard assignment of the
semester.
-
Wednesday, December
3, SPSS test (10% of the course
grade).
This test will be very similar in format to the
practice test you took before the holiday break. The
concepts that will be covered on this test will be
drawn from the same list I posted for your first
practice test.
-
Thursday, December 4.
All of the BlackBoard SPSS workshops will be
closed permanently at 10pm on Thursday.
-
Your
final exam for
this class
will be when
the University
has scheduled
our exam
period:
Saturday,
December 6 at
8am. All
students will be required to take the third unit exam
during the final exam period. You also will be writing
an essay. If you miss the SPSS test for whatever
reason, you will take that test during the last part
of the final exam period.
|