Unit 3: Using
SPSS to analyze datasets (materials
for the other course units are accessible from the course
homepage).
Materials you will find useful for this
unit:
The course's big deliverables:
The grading rubric for your final
presentation of the thesis. Note: some aspects
of this assignment will be changed due to social
distancing and masking requirements. We likely will do
the presentations using virtual meeting software.
Thesis assignments (these may be modified as their due
dates approach, so don't print them out way ahead of
time):
Thesis assignment 4: A draft of the thesis section describing your hypotheses data, variables, and methodology (which will require you to attach a codebook that you will create with this template).
Thesis assignment 5: A draft of the thesis's findings (including tables and figures) and conclusions
Professional Development Assignments (these may
be modified as their due dates approach, so don't print
them out way ahead of time):
Prof. Development 1: Mandatory mentor meetings (Specifics and grade rubric distributed by email)
Prof. Development 2: A series of BlackBoard assignments on using SPSS and interpreting its output.
Some initial observations about this part of the course
Starting when we get back from break, a significant block of class time will be spent completing SPSS workshops that (re)teach you how to analyze different types of variables and their relationships with one another using SPSS. The materials you will need for the workshops are in a folder in the PPTs/Assignments folder that is linked from the course homepage (or they soon will be if they aren't ready yet). The workshop assignments will be uploaded in parts to Blackboard after we have reviewed the relevant materials in class. Keep in mind:
No one expects you
to be an expert in either advanced statistics or
SPSS going into this block of material. I
know that it has been a long time since many of you
have taken a research methods class or used SPSS,
and it is not unusual to have some students taking
PSC 2019 and PSC 4099 at the same time. You will
have lots of practice computing, visually
summarizing, and interpreting statistics before you
will need to complete the major thesis assignments
that require advanced statistical work.
All of the resources you will need to do the statistical work for your thesis are in or linked to this schedule. As you can see below, I have put a lot of details and all kinds of links to resources into this part of the course schedule. Fear not. All of the materials are listed here because I want you to have all of the resources you may need for your thesis located in one place.
I am aware that students have different learning styles, and the multitude of resources below reflect this understanding Everything that you need to be able to do to complete the statistical work for your senior thesis is covered in screencasts and other materials that are linked below. And we will be reviewing almost all of these materials with hands-on exercises during the workshop..
Outside of class, you need to continue to make progress on your own project to meet several deadlines.
In week 10, the second week back from break, you are going to give your first presentation. The presentation is going to require you to summarize the front end of your thesis project and to demonstrate that you have coded your study's dependent, independent, and control variables correctly.
In week 11, a week after your presentation, you will need to submit Thesis assignment #4, which is a draft of your study's reasearch methology. If you would like to review an HPU student-authored sample of a descriptive statistics table and a well-written methodology section, see either of these paper:
Madison Deane '2024: "Fate, Randomness,
and Economic Policy Attitudes." See pp. 7-9.
Week 8 is the mid-term
break: No classes
Week 9 (10/15, 10/17)
Note: You will have
an SPSS assignment due at the end of week 9 on
Friday, October 18 at 5pm.
It will be posted on Blackboard. This
assignment will cover the material from the week's
workshop.
The first week of the SPSS workshop will cover
several statistical techniques necessary to produce the
type of "descriptive statistics table" that social
scientists typically use to summarize the central tendency
and distribution of their project's variables. You will
need to use these skills to create a table that will be
submitted as part of the presentation you will
deliver during Week 10. At that time, you also will need
to submit a draft copy of your codebook (that
you will create with this template). If you would like a
good example of a descriptive statistics table, see page 9
of Maggie Selman's paper, which linked
above. The same paper includes the study's codebook in
Appendix A, which starts page 21.
This week, we will review what a standard deviation is and what information it tells us. We will also talk about why categorical and ordinal variables typically are recoded so that every response (or subset of collapsed responses) are recoded as dummy variables. Social scientists can and frequently do analyze categorical and ordinal dependent variables; however, we will not be covering these types of regression.
We also will review a couple of bivariate analysis techniques. Researchers often begin to look at the relationship between their independent and dependent variables with figures visually showing how different categories of their independent variables correspond to different values for their dependent variable/s in the sample they are working with. So, we will practice using SPSS in combination with Excel to create bar charts showing showing how different independent variable groups are distributed on a dependent variable. As you will see in the research article you will be re-reading for this week (the one on why so many Brazilians voted for an illiberal presidential candidate), researchers also sometimes combine bar-chart figures with confidence intervals to both show and measure the relationship between their independent and dependent variables. You will not be required to create this type of a figure for your thesis and we will not practice doing so. However, below, you will be given the resources to make this type of a bar chart in case you have time to incorporate this type of work later in the term or for a conference paper.
Bar charts can only show us how different values or
groups on our independent variable/s may be associated
with the values of dependent variables in our sample. To determine
how consistent the relationship is between each
independent and dependent variable in a study, social
scientists typically employ formal statistical tests.
Specifically, these tests assess how likely it is that
what we are seeing in our sample would be found in
repeated sampling. You learned about the most common
bivariate association tests in research methods (e.g.,
chi-square and T-tests), but
our workshop
will focus on just bivariate
correlation analysis because this is the only
type of bivariate statistics that will be required for
your thesis work. Using this statistical procedure is
critical because it allows us to anticipate problems with
multicollinearity,
which is when two or more of our independent variables are
so closely correlated that a regression model will not be
able to accurately assess the influence of each
independent variable by itself.
Now that you know what we will be doing in week one,
here's what I would like you to do before we start this
week to get the most out of our class meetings:
Before our first meeting this week, please
make sure that you have printed out out and are
familiar this article: "Did
Brazilians Vote for Jair Bolsonaro Because They
Share his Most Controversial Views?"
Why do I keep asking you to use the same sample
article?
First, I value your time and know that most of your
reading in this class is for your own project. Using
just a few sample articles is meant to cut down on
your "homework" time.
Second, the inspiration for this article occurred
in a senior seminar class a few years ag0; this
study was written to be a senior-seminar length
manuscript that mostly uses only the statistical
techniques that we learn in this course (and those
that we don't learn are covered in optional
screencasts below).
Third, the article provides a sample
methodology section for you to consult as you work
on Thesis Assignment 4, starting on page 4
(the one thing related to your assignment that is
missing from the article is a summary statistics
table).
Finally, the article provides a good example of how you need to pay close attention to how the theory, methodology, and findings sections fit together: Notice that the same main arguments and theory-derived hypotheses and variables are the focus of each section of the article. Notice also, that the ordering of the theories, their hypotheses, and the variables used to test those hypotheses is the same in each section. Incidentally, one mistake in this article is that the author (me) has the variables listed in reverse order in the figures and tables.
As we are working this week and next, an perhaps
again later in the term, you likely will need to
review at least some of the screencasts and other
material linked below. If you are prompted
below to read or watch material before we meet in
class to talk about a statistical method, please do
the work or you may well find that it is difficult
to understand what we talking about in class.
Much of the material below is optional. but it should be reviewed if you need more guidance of particular methods. The students in this class have varying degrees of experience and comfort using SPSS, so you will need to tailor your class preparation this week and next based on how familiar you already are with using SPSS for univariate analysis, correlation, linear (aka OLS) regression, and logistic regression.
Here are a set of resources for the statistical concepts
and SPSS methods we will be covering in the first week
of the workshop:
Making a descriptives statistics table and
interpreting its data
Important: Every student will need to include a descriptive statics table in their presentations and thesis. The table will need to include sections with your dependent, independent, and control variables. Here is a sample of what a descriptives statistics table should look like. This table came from a study that Drs. Setzler and Yanus presented at a conference. The paper looked at how much the attributes that predict sexist views also predicted voting for Donald Trump in 2016. This study had no control variables, so none were listed in the table. Incidentally, a revised version of the paper was published.
Like the other statistical methods we will be reviewing in detail, using SPSS's commands to generate descriptive statistics is covered in the third of the three instructor-compiled handouts on using SPSS for senior thesis work. If you want to read more on how to use SPSS's point-and-click method to generate univariate statistics, here's a document from another source that goes over those procedures.
Some important ideas to know about descriptive statistics and what is reported in a desriptive statistics table:
What needs to go
into the table? At a minimum, descriptive
statistics tables typically report each variable's
maximum and minimum values as well as its mean and
usually its standard deviation.
For your thesis work, you will also need to report the number of observations for each variable because this information is how we will verify that you do not have a serious issue with your coding (or that you have one or more variables that were only asked of part of the sample. If you have a variable where a large share of the respondents are missing data, make sure you have coded that variable correctly. If a variable was only asked of part of your sample, you will need to think through how you are going to deal with the situation.
It also is
standard practice for descriptive statistics
tables to report standard deviations.
The standard deviation is a widely used
measure (in all fields) for looking at the
distribution of an interval variable across
its range of values.
A straightforward way to explain a standard deviation (SD) is to say that it is the measure of how far away from the mean most respondents' answers are scattered. Answers that are within one standard deviation of the mean are somewhere in a range that is approximately a third below to a third higher than the average respondent's answer..
Comparing both means and standard deviations for different groups is useful. For example, if Professor Washington's advisees all have an average GPA of 3.0 with a standard deviation of zero, we know that each of his advisees has received a B in every one of their courses so far. If Prof. Jones's advisees also have an average GPA of 3.0, but the SD in the GPAs of her advieess is 1, The SD tells us that around two-thirds of her students have GPA between 1.0 and 3.0. In short, there is no difference in the GPA's of the students advised by Professors Washington and Jones, but the distribution of advisee grades is higher among Professor Jones students. If you were a very strong student, you might prefer to be one of Jones advisees, since lots of them are earning A grades. If you have been earning lots of Cs, you might want to be advised by Professor Washington, since his advisees appear to earn Bs in every class.
Remember how to interpret the means of dummy (aka "binary," "dichtomous," and "0/1" variables. By convention, we report the standard deviation for each dummy variable, too, even though the SD information for a dummy variable is not useful. By itself, the mean value for a dummy variable reports its distribution in your sample; e.g., a value of .37 for the dummy variable Democrat indicates that 37% of the sample identifies as a Democrat.
Descriptive
statistics tables report means, so what do you do
if you have an independent variable that is an
ordinal variable? You have two choices. You
can treat is a continuous variable and report its
mean (that is what is done for several variables in
the article on Brazilians voting for Bolsonaro that
you read). Alternatively, can create one or more
dummy variables out of an ordinal variable, which is
often done when levels of education are a key
independent variable.
The means and
standard deviations of a categorical variable
are useless. Categorical variables should be recoded and analyzed
as separate
dummy variables in the descriptive statistics tables. For
example, if you have party variable with three
categories, a mean of 1.3 doesn't mean anything, so
you'd want to create three dummy variables, one for
each party, if putting this information into the
descriptive statistics table. If your analysis would
benefit from also summarizing the distribution of a
categorical (also called "nominal") variable, you
will want to make a separate table or figure
reporting its "frequencies" (in percentages) because
the mean and standard deviation of a categorical
variable does not communicate useful information for
these types of variables.
After we have talked about using SPSS to generate descriptive statistics, watch this optional screencast if you want a step-by=step refresher on how to make a descriptive statistics table with SPSS: https://www.youtube.com/watch?v=UavaZ8cXuWg&feature=youtu.be (6min, 22 sec). .
Even if you already know how to do this, there are some handy shortcuts covered in the video that may make watching it worth your time:
(1) Use SPSS to generate the results you need for a descriptive statistics tables; Analyze -> Descriptive Statistics -> Descriptives. Then, select just the variables you want and use the check-boxes in "options" to generate results for only the mean, standard deviation, min. value, and max value. These statistics are generally reported in the order just listed in a table where each variable is in a separate row and the four statistics are each in a separate column.
A pro trick: For the step above, when
using a PC, you can select multiple variables at a
time to add or remove from the list of variables you
want to analyze by holding the control button down as
you each of them select them. You can select a range
of variables by holding the shift key down as you
select variables. For a Mac only, depending on your
trackpad settings, you usually need to press the
control button in combination with the shift key to
select a range of variables.
Here's an
even more useful trick: To make it faster to find the
variables you want to analyze, you can change the view
of the variable list so that variable names are listed
alphabetically. To do so, hover over the variable list
and use a right-hand mouse click and select the
option to see variables names. Repeat this step to
order them alphabetically. For a Mac, only, again depending on
your trackpad settings, what is a right-hand mouse
click on a PC typically involves pressing the control
button while clicking your trackpad.
In fact, it
is so fast and easy to sort and find variables in the
descriptives statistics window that this usually is
the best way to create a list of your study's
variables any time you need one. Just
point-click-and-the-paste a descriptives command. In
syntax, you can move the variables around if you think
you are going to want them in a different order.
(2) As noted
above, for a 0-1 coded (i.e., "dummy") variable only,
calculating its mean will reveal the percentage of
respondents belonging to the category. For example, if
you have a variable coded 1=female. 0= male, a mean of
.326 indicates that 32.6 percent of the sample is
female.
The fastest way to create descriptive statistics for a variable's subgroups (i.e., for each response category) is to split the your dataset on that variable and then run the descriptives command. You could use this strategy, for example, if you wanted to compare different racial/ethnic groups' average income or the frequency at which people who identify with different political parties agreed with a particular statement
After we have analyzed subgroups in class, if you need additional information on splitting a dataset and running analysies, watch this optional screencast: https://youtu.be/YWgz0bKcq-M (a little over five minutes). This technique allows you to easily generate descriptive statistics--including means--across different groups. The example in the video look at mean levels of support for torturing terrorism suspects for people belonging to different political parties
Some things to
remember from the video:
(1) To tell SPSS that your analyses should be run on different groups for a variable;
Data-> Split file ->Compare groups.
Then, select a group variable to work with. In the video, the data is "split" so that analyses will be run for individuals grouped by their partisanship.
2) Once you
are done analyzing groups, you MUST turn off the group
comparisons to get SPSS to go back to normal:
Data-> Split file -> Analyze all cases.
If you don't do this step, your subsequent analyses will keep analyzing subgroups
(3) And something important I left out of the screencast: Sometimes, you will want to create a variable to use just for this procedure. For example, in your analyses, you almost always want to use dummy variables for party identification (i.e., create dummies for Republican, Democrat, and Independents). If you wanted to split your dataset by partisanship so you could generate descriptives statics quickly for a table comparing how Democrats and Republicans answered questions in the survey, you could create a categorical variable (e.g., Dem=1, Rep=2, Indep.=3) for a split your data so that a descriptives or frequency command would report results first for Democrats, then for Republicans, and then for the other groups. If you do not create this extra variable and instead split your data on your Democrat dummy, the stats you will get for the non-Democrats would include Republicans and independents rather than the Democrats-vs.-Republican stats you are looking for.
Making bar graphs in Excel that visually show the
relationships between two or more variables.
Important: Every student's thesis and presentations will need to include a bar chart examining the relationship between their independent and dependent variable/s (if you have control variables, they do not go into these analyses).
Here is an
example a bivariate bar chart; it comes from a
study looking at why women and men differed in their
level of support for civil liberties in the US and
Canada over a ten year period. When you look at the
table, consider how much easier it is for the reader
to analyze the association between gender and support
for civil liberties in a figure than would be the case
with a complex table listing lots of percentages.
When you are comparing how different groups vary with respect to another variables (for example, the percentage of women vs. men, Democrats vs. Republicans, and whites vs. non-whites who voted for Joe Biden in 2020, the most visually appealing way to do so is to generate means or frequencies as instructed above, and then to combine your SPSS results for subgroups into a single, nicely formatted chart created in Excel (and then added to your paper or presentation via a screenshot of the figure).
In class, we will review how to use SPSS and Excel
together to make a bar chart. If you need a refresher after class,
watch this optional screencast (https://youtu.be/T6kHpZ2oReQ) to review how to use
Excel to make bar charts from frequency or
descriptives output. In the example, data
from four separate, unattractive SPSS frequency charts
are combined into a single figure that compares the
percentage of men, women, Republican, and
non-Republican voters who voted for Donald Trump in
2016. Important: These are the
kind of frequency tables that should be in your
thesis and presentations rather than SPSS-generated
figures.
(1) The
fastest way to generate statistics for different
subgroups is to use the split file option described
above. If you want to compare mean values for
different groups (say, average years of education),
split the file for the first group (say men vs.
women as in the example video) and run
descriptives->frequencies. If you want to compare
the percentage of each group that did something, run
descriptives->frequencies. For the later of
these, you want to use the "valid" response data in
your Excel chart
(2) Enter
the data in Excel so that the auto-generated chart
(insert->choose bar chart) will be able to see
how the data is organized with respect to the
labels. In the example in the video, we have
something like this (note that the partisanship and
gender labels are each centered in "merged cells"
that span the group categories under them:
Using bivariate correlation analysis to look at the association between two variables and to identify potential problems with multicollearity later on.
Important: Every student's early presentations will need to include a correlation matrix that includes each independent and (if you have them) control variables. You will not include a correlation matrix in the final presentation or paper, and you should see Dr. Setzler before analyzing the table's results in your papers or talks. The matrix can be an SPSS results screenshot rather than a formatted table. This is an in-class verification step to make sure that you do not have two or more independent variables so closely correlated that you will have problems in regression analyses (more on multicollinearity below).
Use the SPSS command Analyze -> Correlate -> Bivariate to look at the association between any two interval variables (including dummy variables). If you have little idea what correlation or a best-fitting/regression line is, take a look at this brief summary or watch the first few minutes of this screencast (https://youtu.be/X2cbmF-SR3I; 14min 05sec).
Print out and review this one-page handout of annotated SPSS output for correlation.
Here is a summary of key ideas to know about what a bivariate correlation statistic is and does:
Note that you will need to summarize *any*
bivariate correlations in your presentations or
these findings papers.
The main reason we be reviewing and using
correlation analysis is to identify any independent
variables in your project that are so closely
associated that regression analysis for these
variables may not work well in a regression model.
Also, many of the concepts involved with
correlation apply to the regression analysis
statistics you will be using in your projects.
Remember that a correlation test measures how consistently and in which
direction, but not how much, an
increase in one dummy/interval variable predicts
an increases or decreases (if the correlation is
negative) the value of another dummy/interval
variable. Note, if we want to know how much an
increase in one variable, by itself, corresponds to
an increase/decrease the value of another variable,
we would need to run a regression analysis.
Association measures, including correlation coefficients, also report a significance statistic. The significance statistic is reported as a p. value. We use a p-value to determine whether any association uncovered in the sample is small enough that an association between these two variables would be reliably found in repeated samples of the larger population. With a significance statistic that is SMALLER than .05, we can say with confidence that there is a meaningful association; if the significance statistic is LARGER than .05, we can not be confident that we would consistently find an association between our variables in repeated sampling of the larger population.
<.10 means that there is a very weak
association between the variables;
.20 can be interpreted as a meaningful but
modest association; .
.30 is a moderate association, and
>.40 is a strong association,.
But, in every case, you need to put correlation findings into context (e.g., a .40 association between being Republican and being conservative would be a much weaker finding than you would anticipate, so it wouldn't make sense to refer to this scenario as being evidence of a very strong association). Most of the association statistics range from -1 to 1, and any negative statistic means that increased values in one variable is associated with declines in the value of the other variable.
Correlation and regression make some assumptions
about the structure of the involved
variables--specifically, they must have a linear
relationship. Measuring two variable's
correlation--whether with bivariate correlation or
some type of linear regression--means that you are
assuming that the relationship between one
variable and another is linear. This
means that you think that an increase in the value
of one variable consistently causes the same amount
of increase or decrease in another variable. It is
important to remember that two variables can be
closely related to one another without being linear.
As an example, age and income from work are related,
but the relationship is curved; each year of life
after, say 16, corresponds typically to an increase
in wages up to a certain age, let's say 60, and then
wage income typically declines with each additional
year of age. Similarly, in saving for retirement,
your investments at a young age will generate some
returns, but if you stick with it, the gains will be
much greater later on because the relationship
between time and savings is exponential, not linear.
In short, if you have reason to think that there is
a relationship between your independent and
dependent variable that isn't linear, see me for
assistance; there are statistical ways to handle
non-linear relationships.
You may have heard somewhere that "correlation
is not causation." It's correct.
Correlation (or regression for that matter) can't
tell us that an increase in one variable is causing
an increase in another variable if it is plausible
that the reverse is likely (i.e., y causes x is
perhaps as likely as x causing y). It is our theory
that guides what we think the independent and
dependent variable are.
Some key ideas to know about multicollinearity
What exactly is the problem with multicollinearity? Imagine that you think that one of two men who are close friends is committing a certain type of pretty unique crime in the evening. He clearly isn’t the only committing this type of crime because sometimes there have been incidents where neither he nor his friend has been present. Your suspect and his friend hang out together a lot, and when they've been in an area at night, there frequently is evidence the next day that the criminal act has occurred. You would like to pin the blame on your main suspect, but you have a problem. While you believe that your main suspect has been doing the all of crimes and his friend hasn’t doing any of them, it's also possible that both men have been committing these crimes, taking turns. Or they both could be doing the crimes together. Or, it may be possible that you've identified the wrong guy entirely and it's the second man doing all of the crime that happens when they are together. To reliably test your hypothesis that it is just the first man committing crimes, ideally you will have lots of instances where only one of the guys was in an area on a given night, so you can see if a crime happened that night. If they are such good friends that you don't have very many these observations, you many have to determine who is guilty by using just a few observations (something we try to avoid with statistics) or by splitting responsibility and assigning the crime to both of them equally whenever both were present (which could be flat out wrong). That’s what multicollinearity is, and it is why we look at how correlated our independent variables are with each other.
How do we know how much each independent variable is correlated with each other independent and (if we have them) control variable? Before you run any regression model, you should use Analyze -> Correlate -> Bivariate to create a correlation matrix that includes all of your independent variables.
If two independent variables or any pair independent and control variables have a correlation of higher than .7, you want to think very carefully about interpreting these variable's results in a regression model, especially if the results look different than what you would expect. If you have a very large sample--let's say 4000 observations or more--even with two, quite highly correlated variables, you are still going to have lots of observations where the variables are not moving in the same direction all of time. However, in a smaller sample of 1500 observations or fewer, you need to consider the possibility of multicollinearity distorting your results for the highly correlated independent variables.
If you have two independent
variables that are really highly correlated (.80 or higher),
unless you have a very large dataset like the CCES,
you will either need to combine them (perhaps into
an additive index), or eliminate one of the
variables because the two measures are effectively
capturing the same thing even if it is not clear
why. We will talk specifics later on when we turn to
interpreting regression models, but here jut note
that the decision to drop a variable to avoid
multicollearity (let's say that being conservative
and being Republican are very highly correlated,
which is the case these days), typically is based on
a combination of its relative theoretical relevance
and how removing specific variables changes the
explanatory power of the model as a whole (i.e., its
r-square statistic).
Week 10
(October 22, 24): Student presentations
This week, you will
give your first practice
presentation. It should go
a minimum of 7 minutes and no more than ten.
The main
purpose of this presentation is for you to get
feedback on how the different parts of your thesis are
coming together and for me to assist you in identify
any variable coding problems.
When you present, you
will need to present and turn in a descriptive
statistics table listing, in order, your dependent,
independent, and any control variables. All of these
variables should been coded as either interval or 0/1
dummy variables.
Your presentations will not be graded, but your preparation and engagement while other students are presenting should be consistent with the professionalism/participation grade you hope to earn. Note that everyone must be present, regardless of whether you are giving a talk that day or not. You will have a graded preliminary presentation eventually, but that will be toward the end of the semester.
You need to use a PPT (or similar program) for your presentation. For an idea of what your presentation PPT should look like, take a look at this sample presentation (I have removed the parts that covering statistical techniques that we will learn about later). It was prepared for an academic conference where Prof. Setzler presented findings a project you have been asked to read about several times. As you can see in the sample presentation, you want short summary points on the PPT rather than long block of material that you will read to the audience. It is important to use a consistently formatted PPT because it keeps you organized, allows you to look up at an outline on the big screen so you aren't staring at notes and instead look like you are trying to guide your audience.
You should address the following topics:
Introduction:
Quickly, what is your main research question?
Quickly, why is it interesting/worth answering?
Concepts and theory. Based on previous academic scholarship:
Quickly, what do we already know about your topic, and what do we still need to know?
Data, measurement, and hypotheses:
What do you anticipate the answer/s to your research question will be? Specifically, you want a set of clear, concise hypotheses.
Briefly, what dataset are you using? Where and when did it come from? Is there anything special that we need to know about the sample? Make sure to review the sample presentation to look at what you need here.
Concisely, how exactly are you measuring/coding your dependent variable (what you want to explain), and independent variable/s (the things you think influence the outcome of your dependent variable),
Also, note very briefly your control variables (other influences that are of secondary or no importance to the study but that likely influence the dependent variable/s)
Preliminary findings:
You must have a descriptive statistics table, which needs to include the number of respondents for each item in your study as well each variable’s minimum and maximum values (this requirement helps me to identify coding issues). Later on, your descriptives chart should look like what you see in research articles (i.e., created in a word processing program rather than being SPSS output, like this example ). For the first practice presentations only, SPSS output is fine (let's make sure that you've got the right variables in there before you spend a lot of time formatting). You do not need to explain this table. Show it, and say "Here is some summary information about each of the variable in my study. Let me give you a minute to look at it," and count 10-15 seconds out mentally so your audience has enough time to look over the table.
Please note, if your primary focus is to compare two or more groups (e.g. women vs. men or folks of different ethnicities, you may combine your descriptive statistics and bivariate analyses table (and only the table), For example, if you were looking to see if men and women have a different level of patriotism, you could generate summary statistics (i.e., means) for men and then for women across all of your variables.
You need to have at least one bivariate analyses, specifically, a bar chart that you must create in Excel. The chart show us what the relationship is between one or more of your independent variables and dependent variables before controlling for any other variables that may influence the relationship.
Conclusion:
Very briefly, tell us what you are going to do next. For this first presentation, that can be a sentence that says, "The next step in this study is going to be to use regression analysis to get a better sense about how important each of my independent variables is in explaining X and whether what we are seeing in these bivariate analyses holds up once we isolate the influence of each of my independent variables.
Weeks 11/12 (10/29,
10/31, 11/5, 11/7): SPSS workshop on multivariate
analysis.
During these two weeks, we will continue to review
using SPSS to analyze different types of variables and
their relationships with one another. Specifically, we
will be using linear and logistical regression analysis to
generate the statistics that social scientists typically
use to explain and compare the influence of several
different variables on some type of attitude or
behavior.
On either Tuesday, 10/29 or 11/31 (a week after your presentation), please submit electronic and hard copies of Thesis assignment 4: A draft of the thesis section describing your hypotheses data, variables, and methodology (which requires you to attach a revised version of codebook and a descriptives statistics table with information about each of your study's variables).
After we have covered
linear and logistic regression in class, a
BlackBoard assignment on regression analysis
will be due Monday 11/4.
Here are a set of resources for the statistical
concepts and SPSS methods we cover in the workshop. Please
note that you are asked to read or watch some of these
materials ahead of class:
The materials in the first block of
material provide an overview of doing linear (AKA
OLS) regression analysis with SPSS. Most
senior seminar papers will rely on logistic rather
than linear regression; however, the materials
on logistic regression assume that you understand the
core concepts of linear regression.
Important: It is very unlikely you will need to include any linear regression in your thesis because thesis students typically use dichotomous dependent variables.
So, why review this linear regression? You will need to know how to compute and interpret linear regression output to do the workshop assignment. We are reviewing the type of regression you use with a continuous (interval) dependent variable because it is easier to understand several key regression concepts if you first learn about them using bivarate and multivariate linear regression. These ideas include: unstandardized regression coefficients, their slope, and their statistical significance; as well as the measure of overall model fit (r-square), line of best fit, standardized coefficients (betas), and how dummy variables and interactive terms work
Ahead of meeting on linear regression, read and print out this handout of SPSS linear (aka OLS) regression output with annotations (the variables are on the same ones discussed in the screencast below, which looks at using linear regression to measure the influence of different variables on a person's level of support for torture (1 = "never justified"; 7 = "always justified"). The annotations remind you how to interpret R-square, a model's constant (aka, its intercept), unstandardized regression coefficients, and standardized coefficients for linear regression models.
After of our meeting/s on linear regression, you have the option of watching this screencast: https://youtu.be/X2cbmF-SR3I; 14min 05sec) It doesn't cover any SPSS work; instead reviews the basic concepts behind correlation, regression, and multiple (aka multiple linear or OLS-ordinary least squares) regression. The first five minutes explain bivariate correlation (i.e., the consistency of a positive or negative association between two variables is), while the next five minutes look at how we measure the typical impact of one variable on another (i.e., the regression slope). The final five minutes or so goes over multiple regression (i.e., how we simultaneously look at the influence of multiple variables).
After of our meeting/s on linear regression, you have the option of watching this screencast: https://youtu.be/xzl8OxPsM8s; (about 11 min) that walks you though through the process of using SPSS to calculate and interpret the output of a linear regression model (the model looks at a 7-point measure of support for torturing terrorism suspects to gain information. (Note: At the start, the screencast references a longer version of the video that has since been pulled from my website).
What to remember from the screencast (but with made-up, extended examples to illustrate how different kinds of coefficients are interpreted):
(1) When you have a continuous/interval dependent variable and want to use linear regression, do this command: SPSS: Analyze -> Regression ->Linear. You will then need to identify your dependent variable and all independent and (if you have them) control variables. SPSS doesn't know or care whether you see a variable as being a control or independent variable, so put them in the model together (but in the same order as you have them in your descriptives statistics table).
(2) Really important: If your model is going to include dummy variables, carefully think through what your reference categories will be as you select the model's independent variables! Remember, if you use dummy variables, you have omit one category to use as a reference category. If you put in a dummy variable for males, the reference category in your output statistics will be non-males. It you put in dummy variables for both males and non-males, you do not have a reference category, and the last dummy gender variable you entered will be dropped by SPSS as it does its calculation, even though this may not be the variable you wanted to use as a reference category. Finally, If you put in dummy variables for both Republicans and Democrats, your reference category will be people who are neither Republican nor Democrat, which may not the comparison you are looking for if you, for example, have hypothesized that Republicans will be the most supportive partisans of using torture on suspected terrorism suspects. If you are making that hypothesis, your model should include dummy variables for independents and Republicans so that your results will show you how different each of these groups is from the reference category (Democrats) and whether those differences are statistically significant.
(3) Once you have your results, you will want to focus on two sections of the output. The first step in interpreting regression output is to use your SPSS output to figure out how well the variables in the model collectively explain variation in the dependent variable. To assess a model's "fit,” use SPSS’s adjusted R-squared statistic from the output. An example of interpretation: If a model's adjusted R-square is .365 and we are predicting how much a person supports torturing terrorism suspects on a 1-7 point measurement, we would way that, "the variables in this model collectively account for about 37 percent of the variation in how much people support torture." Alternatively, depending on what hypotheses were testing, we might interpret the same statistic by saying, "Most of what predicts the level of support for torture--that is nearly two-thirds of the explanation--for the dependent variable is due to factors not considered by this model" [1 - .365 = .635].
(4) Then, go the coefficients table at the end of the output, and look at each independent variable to determine if it is significant or not. With a regression analysis, the statistical test is assessing whether or not repeated sampling would find that the influence of a given variable is neither zero nor in a direction opposite of what the coefficient sign says (minus sign for a negative relationship and no sign for a positive one). So, the column “Sig,” lists the probability that a given coefficient is not actually different from zero or the relationship is signed in the wrong direction. By convention, we want the probability (aka, “the p-value”) of the coefficient to be meaningless or wrong to be less than .05. If a variable’s coefficient is greater than .05, we will not interpret that statistic because we aren’t sure if there is a real effect on the dependent variable.
(5) Next, we want to know how increases in each independent variable are predicted to influence the estimated value of the dependent variable when all other variables are held at a constant influence. To do so, we need to look at the unstandardized coefficients (they are listed in the “B” column), so that we can interpret each of the independent variables that are statistically significant (we typically will have little to say about control variables, if we have any, because they are in the model solely to make sure we have isolated the influence of our independent variables. For example, suppose that we are trying to explain what variables predict a person's level of support for torturing terrorism suspects on the seven-point measure. If we had unstandardized coefficients of .451** for our dummy variable Republican (assuming it is the only partisan variable in the model) and -.134* for edu4 (measured as a four-unit, interval variable), a suitable interpretation of these results would be:
"Compared to other non-Republicans (the reference category), Republicans' level of support for torture was around a half of a point higher on the seven-point scale, controlling for the influence of other variables. On the other hand, each increase in education modestly decreased support for torture. Respondents with the highest levels of education had around a half a point lower level of support when compared to the least educated" (i.e., each level of education reduced a person's score by -.137; thus going up three units--from 1 to 4--is equal to 3 x -.134 = -.402). Note that we always need to look carefully at what the reference category is when using dummy variables. If our model had included dummies for both Republicans and Democrats, the comparison here would be to independents, since they were the only partisan group not in the model.
(6) As a last step, we will want to consider how the predictors rank in their influence on the independent variable. What factors most determine the value of the dependent variable? We determine this by comparing the variables' "standardized" coefficients. which are listed in "Beta" column of the results. These statistics each measure how many standard deviations the dependent variable increases or decreases with each one standard deviation unit increase in the applicable independent variable (e.g., going from having an average level of education to roughly the 84th percentile). Most often, these statistics will be less than one, suggesting that one standard unit increase in a given independent variable corresponds to less than a one-standard-deviation increase/decrease in the dependent variable's value.
For example, when predicting support for torture--as the model does in the screencast--a one-standard- deviation increase in racism causes a much larger increase in support than a one-standard deviation increase in religiosity or education. In interpreting a model that had these models, we might say, "The standardized coefficients for the model indicate that the most powerful predictor of how much a person supports torturing terrorists is that individual's level of racial animosity."
The
default regression table shows how a one-unit
increase in a given independent variable changes
the value of the dependent variable when the
effect of all of the other variables is set at
each variable's mean value. While that
makes for nice tables, discussing specific
scenarios can help to bring the data alive.
In a regression
scenario, we change the values of one or more
variables in the standard regression equation to
something other than their means. For
example, what is the level of support for torture
on a seven-point scale for a 60-year old
Republican male who attends church a lot versus a
30-year old female secular non-Republican, holding
other variables constant at their means? To
calculate the expected level of torture for the
first individual, we would use the following
formula:
The model's unstandardized constant (i.e., the
expected level of support for torture when all
other variables are in the model have a value of
zero, including variables whose range doesn't
include zero)
+ (60 times the unstandardized regression
coefficient for AgeInYears)
+ (1 times the Republican coefficient)
+ (0 times the Female coefficient)
+ (6 times the ReligAttend6 coefficient)
+ (mean value of Edu5 x the Edu5 coefficient)
+ (mean value of VotedLastElection x the
VotedLastElection coefficient)
= The expected value of TortureOK7 for this
individual
And then you would calculate a second equation, changing the only the scenario variables' values (but not the constant, regression coefficient statistics, or the means for variables not in the scenario) to fit the characteristics of the other hypothetical individual. So, to calculate the level of support for torture for the hypothetical woman described above, the equation's value 60 would be replaced by 30. And, it would be 0 x the Republican coefficient, 1 x the Female coefficient, and 1 x ReligAttend6
After class and
reading through the example I just gave,
watch this optional
screencast if you need more guidance on the process of using
linear regression output and some basic math
to calculate the expected value of your
dependent under different scenarios: https://youtu.be/3m66P8PaD3U. You also
might want to review and print out this resource
for future reference: a handout with the output and
calculations covered in the video.
Pro-tip: If you are using linear regression
in your thesis and have very many independent
variables, it's a lot faster to calculate
scenarios with the assistance of Excel. You
can copy and paste in regression and descriptives
results directly from SPSS output and then use
Excel formulas to do all of the math. Here's an Excel spreadsheet
with every variable set at its mean. Notice that
you can just toggle variables to different values
to create all kinds of scenarios. And here's
a screencast on using the spreadsheet: https://youtu.be/A10yOJleGNw
(Again, there's no reason at all to watch this
screencast if you using logistic regression in
your thesis; there is a separate spreadsheet and
screencast specifically for logistic regression
scenarios below).
This block of materials provides help with logistic regression analysis, which is used when your dependent variable is dichotomous. Most senior seminars involve this type of analysis. Even if you are only using this type of regression, you should review the previous block of material on linear regression because this method is an extension of those concepts.
Ahead of our meeting on logistic regression, please take the time to read and print out: this handout of SPSS logistic regression output with annotations. The document is very similar to the one I posted above for linear regression, but this time the dependent variable--support for torturing terrorism suspects to obtain information--has been recoded into a dummy variable. Respondents who said torture was sometimes, often, or always justified were coded "1"; individuals who said torture is never justified were coded "0."
And make sure that you print out a copy
of this one-page document,
which is a sample of what a logistic regression
table in a research paper
for HPU political science classes should look
like. The first column in the table
shows you how to write up the results for the same
model reported in the annotated SPSS output above.
The second and third columns summarize the results
(made up for the purpose of this exercise) of two
additional regression models that the author has run
separately, first on a male-only sample and then for
women. Putting the three regression models into a
single table is way more efficient than pasting raw
SPSS output with lots of irrelevant information, and
displaying the regression models side-by-side allows
the author to see and discuss the different effects
that his independent variables have on support for
torture among men and women. If you have multiple dependent variables or are
looking at regression models for different groups
(e.g., one regression model
for women
and one for men, put the regression model results into the same
table.
After attending the class where we talk about
logistic regression, if you feel like you
need to review more of the basic about why we need
to use a special kind of regression to predict the
likelihood of dichotomous outcomes, watch this
12 min. screencast
(https://youtu.be/uUf3h8ifZxE)
that explains in detail what bivariate
logistic regression is, how it works, and why it
is often used with ordinal dependent variables
after they have been recoded into 0/1 dummy
variables. This screencast covers concepts; the next
one will look at how we use SPSS to run and
interpret logistic regression models.
Once you are sure you understand what logistic regression is, you can review this 15-minute video (https://youtu.be/78VreRsq5XY) on running and interpreting the output of logistical regression in SPSS.
Helpful hints from the video (but with a completely made-up example to illustrate how coefficients with different values should be interpreted). While the video looks at what causes someone to agree that "generally speaking, men make better political leaders than men," in the example below I discuss the results from a model used in a paper I co-authored about who intended to vote for Donald Trump a month out of the 2016 Presidential election:
(1) How to run this type of regression. When you have a dichotomous dependent variable and want to use regression to predict how changes in an independent variable increases/decreases the likelihood of an outcome, you use this method:
SPSS: Analyze -> Regression ->Binary Logistic
(2) Really important: If your model includes dummy variables, carefully think through what your reference categories will be! (I included the same note above for OLS regression, but repost it here, because students often struggle with this concept). If you put in a dummy variable for males, the reference category in your output statistics will be non-males. If you put in dummy variables for both Republicans and Democrats, your reference category will be people who are neither Republican nor Democrat, which may not the comparison you are looking for if you have hypothesized that Republicans will be the most supportive of torture. If you are making that hypothesis, your model should include dummy variables for Democrats and independents so that your results will show you how different each of these groups is to the reference category (Republicans) and whether that difference is statistically significant.
(3) In the SPSS output, you will ignore most of it, going straight to the “pseudo” R-squared statistic first. This statistic tells you how well the variables in the model as a whole explains the "likelihood" of the outcome you are predicting. Report just one of the two R-squared statistics that SPSS lists in model results. The Nagelkerke is the most like OLS regression’s R-square, so use it. Note that researchers typically label this statistic "pseudo-R-square" when reporting the results of logistic regression because it is not actually the mathematical square of a Pearson's R (correlation) statistic, the way it is for linear regression.
Here's an example of how to interpret a pseudo-R-square:
Let's say the model of who intended to vote for Donald Trump (yes or no), gave us SPSS results including output with a Nagelkerke R-square of .142. In writing up our findings, we might say, "The model's pseudo-R-square (Nagelkerke) indicates that the predictors [i.e., the independent variables] in the model collectively account for just over 14% of the variation in whether or not someone intended to vote for then-candidate Donald Trump." Another way to interpret the same results would be: "While numerous studies have suggested that each of the characteristics in the model were important predictors of who voted for Trump in 2016, the model's R-square (Nagelkerke) statistic indicates that over 85% [i.e., 1.0 - .142 is > .15] of the factors that led only some individuals to vote for Trump lie beyond the indicators examined here."
(4) Then, go the coefficients table to look at the odds ratios to determine how much a one-unit increase in each independent variable changes the "likelihood" of the outcome you are predicting. when all other variables are held at a constant influence. Odds ratios are listed in the Ex(B) column of our SPSS output. An odds-ratio of higher than 1.0 indicates that increases in the value of that independent variable increase the likelihood of the outcome predicted by the model.
Here is an example of three factors that were significantly correlated with supporting Donald Trump in 2016; however, I am making up their coefficients to review how odds-ratios falling into three different value ranges are interpreted:
Let's say we were predicting whether someone intended to vote for Trump and our results had an odds ratio [Ex(B)] of 12.321 for a dummy independent variable identifying Republican respondents (assuming it is the only partisanship variable in the model).
For our second variable, our output reports an odds-ratio of 1.137 for edu4, which is a four-point measure of educational attainment (Here, we assume that it is an interval variable; if we didn't want to assume that each one-unit increase in education will the same effect on the dependent variable, we would need recode that variable in to a series of dummy indicators, say "more than high school," "college degree or more," and "advanced degree," leaving "high school or less" as our reference category).
Finally, we have a third variable--Torture7--created from a 7-point item asking how much respondents agreed with a statement that "torture should be used to obtain information from suspected terrorists. For this variable, our output reports an odds-ratio of .911.
To recap, a truncated version of our output for this example reads:
Variable Ex(B) (Make sure you get the right column's data!)
Republican 12.321
Edu4 1.137
Torture 7 .911
A suitable interpretation of the odds-ratio results would involve the following steps.
First, consider how you should interpret odds ratios with a value of 1-to-2. Odds ratios in this range indicate a positive relationship that effect that can be stated as an x percentage increase in the likelihood of the outcome occurring. For this example, you would say: "Each increase in education on a four-point scale increased the likelihood of voting for Trump by about 14%" (i.e., 1.137 - 1.0 = 13.7). While it would be correct to say that each one-unit increase in education increased the likelihood of a Trump vote by 1.14 times, it is clearer and more elegant to express the same finding as a percentage increase.
For odds ratios greater than one, the interpretation typically is expressed as an x times increase in the likelihood of of the outcome occurring with every one unit of change in the independent variable. In this example we could say, "Compared to all other respondents, the typical Republican was over 12 times as likely to vote for Trump, after taking the influence of other variables into account."
Odds ratios that are lower than one indicate a negative relationship. Students sometimes struggle with odds ratios between zero and one because the odds ratio is still a positive number. Think about what you would say if you had a dollar to start off with and after a bet you have 60 cents. You could say that you now have only 60% of what you had or that you have 40% less than what you had. In the example output, you would say, "Being more supportive of torture made a person less likely to support Trump. On its seven-point scale, each additional level of support for using torture corresponded to a 9 percent decrease in the likelihood of voting for him" (i.e., 1 - .911 = .089).
(5) Remember: odds ratios that are lower than 1.0 indicate that increases in the value of an independent variable decrease the likelihood of the outcome. These results generally are reported as percentages, which often are calculated either by subtracting the odds ratio from one). An example: Assume we are predicting whether someone intends to vote for Trump and we run a model that switches out the dummy variable Republican for the new dummy variable Democrat. Let's say our results have odds ratios of .101 for the dummy variable Democrat (again assuming it is the only partisanship variable in the model) and .913 for edu4 (as defined above). A suitable interpretation of these results would be:
"Compared to other respondents, Democrats were about 90% less likely to vote for Trump [i.e., 1.0 - .101= .899] after taking the influence of other variables into account. Increases in education also decreased the likelihood of supporting him. Respondents with the highest level of education were approximately 24% less likely to vote for Trump than persons with only a high school degree or less schooling.
For the new Democrat variable, another way to say the same thing, but use the odds ratio of .101 differently would be to say: "Compared to other respondents, Democrats were approximately 90 percent less likely to vote for Donald Trump, controlling for the other predictors in the model."
To calculate the statistic we make in the statement about education’s influence, we first need to figure out what the first unit of change does (going from Edu4 = 1 to Edu4 = 2) and then compound that effect to take into account the two additional units of change that distinguish people with the lowest and highest levels of education. In other words, we need to solve this problem: (1-.913 =.087) to the third power. If you aren't math inclined, Google will do this math for you, just do this search: "(1-.087) to the third power". The answer is 76.10. So, 100 percent - 76 percent = 24 percent, which is the statistic noted above.
How do we get to this statistic again? Going from 1 to 2 units of edu4, makes a respondent 91.3% as likely to vote for Trump. Then, those with a score of 3 on edu4 are just 91.3% as likely to vote Trump as those with a score of 2 (i.e. .913 x .913 = .834). Finally, those with a score of 4 on edu4 are just 91.3 percent as those with a score of 3: 91.3% of 83.4% as likely (i.e.: .913 x .834 = .761). Of course, this is the same result that you get by calculating (1-.087)^3, which is a result that you can have calculated for you if you just do a Google search for: (1-.087)^3.
(6) If appropriate for your hypotheses, consider how the predictors rank in how consistent their influence is on the independent variable. And here's something that is not covered in the video but may be useful to think about. There is some debate about how to best rank the relative influence of each variable in a logistic regression model, but one common approach is to compare their "Wald scores" much like we do with "betas" for linear regression. For logistic regression, this approach tells us which variables most consistently predict the outcome, but not the magnitude of their influence, the way a standardized coefficient does in linear regression.
Calculating scenarios from logistic
regression results: As with linear
regression, interpreting
logistic regression analysis in in an interesting
way is best done with scenarios. For the thesis and
oral presentations involving regression, students who have run logistic regression
models should take the time to see how much a minimum-maximum-change in
each independent (but not
control) variable changes the probability
of believing or doing what your dependent variable measures.
For linear regression, creating scenarios is a
straightforward process because it is easy to use
Excel to calculate specific scenarios from SPSS
output (see the example above). Unfortunately, it is impossible to
create scenarios with raw logistic regression
output without doing additional mathematical
transformations.
With logistic
regression, scenarios involve calculating the
predicted probability of doing or believing
something at a given value of the independent
variable, with the effect of all other variables
held constant at their mean.
What is a predicted probability, and how is it different than an odds-ratio, the latter of which is listed in SPSS's default output? To answer this question, it helps to first understand that odds and probabilities are different ways of conveying the same information, and you can mathematically transform any odds statistic into a probability. Specifically, the odds of an outcome refers to the number of times the outcome is expected to occur compared to the number of times the outcome is expected to not occur. Probability is the number of times an outcome is expected to occur compared to the maximum number of times the outcome could possibly occur.
The reason we go through the hassle of mathematically converting odds into predicted probabilities in logistic regression scenarios is that most people find probabilities a lot easier to understand. Per the
results of a hypothetical scenario that will be posted
below, the odds of a football team with two injuries
winning their next game is .56 to 1, while the odds for
a team with six injuries are 2.85 to 1. In other words,
a team with six injuries should, on average, lose 2.85
games for every game they win.
For most people, it makes a
lot more sense to convey the same information in
probabilities, which can be expressed as percentages: A
team with one injury has a 27.5% chance of losing, while
a team with six injuries can be expected to lose almost
75% of the time.
An odds ratio tells us how much a one-unit
increase in a given independent variable decreases or
increases the odds of an outcome, which can then be
converted into a predicted probability. An
odds ratio of .600 says that each increase in the independent
variable results in the odds of the outcome being .600
times what it was before the increase (i.e., 40% less
likely). An odds ratio of 1.20 says that each increase in
the independent variable results in the odds of the
outcome being 1.20 times what they were before the
increase (i.e., 20% more likely).
So, let's say you are planning to bet on a football team that has only 20% chance of losing when it has no injuries. Another way of putting this is to say that the team has 4/5 of a probability of winning. So, for every four times that team plays, we would expect it to lose once, which means their odds of winning is 4 to 1. When you bet on this team, you might place a four-dollar bet. If the team loses, you lose your 4 dollars; if they win, you get a dollar, plus your original four dollars.
What happens if your team
experiences injuries during the season? How will each
additional injury change the team's odds and probability
of winning?
We could run a binary
logistic regression model and its output might say that
each additional injury a football team will increase the
likelihood (specifically, the odds) of losing by 1.5
times. Using that odds ratio of 1.5, here is how the odds
and probability of the team winning will change as each
additional injury increases the odds of losing by 1.5
times:
0 injuries, odds = 4/1 (probability of losing = 20%)
1 injury, odds = 4/1.5 (pr. of losing = 27.3%)
2 injury, odds = 4/2.25 (pr. of losing = 36%)
3 injury, odds = 4/3.4 (pr. of losing = 45.8%)
4 injury, odds = 4/5. (pr. of losing = 55.9%)
5 injury, odds = 4/7.6 (pr. of losing = 65.5%)
6 injury, odds = 4/11.4 (pr. of losing = 74%)
7 injury, odds = 4/17.1 (pr. of losing = 81%)
8 injury, odds = 4/25.6 (pr. of losing = 86.5%)
9 injury, odds = 4/38.4 (pr. of losing = 90.6%
The bottom line is this: odds ratios allow
us to communicate how each one-unit increase
in an independent variable influences the
likelihood of the outcome. However, most
people don't refer to odds in everyday
language, so it is a lot more intuitive if you
explain regression results by talking about
the probabilities of an outcome under
different scenarios.
And in order to calculate scenarios like those in the football example above, we first have to transform logistic regression's output for the relevant variables. While you can do these transformations in SPSS, it is very complicated to do so (other statistical programs make it much easier). Fortunately, you can do all of the mathematical work in an Excel worksheet if you know what formulas to use, and a spreadsheet is a good place to manipulate different variable values to create scenarios anyway. Your instructor has put together an Excel spreadsheet for you that calculates predicted probabilities for you, depending on what scenarios you choose.
What to remember from the screencast:
(1) First,
point-click-and-then-paste a logistic regression
command that includes all of your independnt and any
control variables (plus your dependent variable).
Don't run the command yet.
(2) Then,
point-click-and-then-paste a descriptives command into
syntax, using any random variable. When the command is
in your syntax, copy and paste to replace the random
variable with all of the independent and any control
variables in your regression model syntax (omit the
dependent variable). The point here is to make sure
that you can run a regression model and create a set
of descriptive statistics that will list variable
results in the exact same order.
Two things
to note that aren't in the screencast. First, you will
make things easier if you delete STDDEV from the
descriptives statistics syntax before you run it
because you will need output only for each independent
and control variable's mean, minimum, and maximum
values. Second, I ran the descriptive command and then
logistic regression in the screencast, but it will
make it easier to find the results you need if you run
the logistic regression first, because we are
interested only very last block of its output.
(3) Select
and run both of the commands, and then open up the Excel spreadsheet that your
instructor has created to assist you in creating
logistic regression scenarios (for the record,
you can compute scenarios in SPSS or even using
Google or Wolfram Alpha, but it will be
much easier to use the spreadsheet I have created
for this purpose). That spreadsheet is in the PPT
folder and in one of the subfolders in the workshop
materials.
(4) Open the last part of the SPSS logistic regression, specifically output block that lists the coefficients, by double clicking on it. Copy and paste the unstandardized logistic regression coefficient (the one in the B column) for the Constant into the appropriate worksheet cell . Then, copy all of the other variable names together with their coefficients (the ones right next to the variable labels in the "B" column), and paste them into the worksheet in the columns that are labeled for this output.
(5) Now, it is time to work on the scenarios portion of the Excel worksheet. Go to back to the SPSS output and double-click on the descriptives output. Copy the means for all of the variables, and paste them into the worksheet column labeled "Scenario." We are doing this because we want to be able to create scenarios that involve some variables being set to certain values, while the remaining values are set to their mean values.
(6) Now,
have some fun creating scenarios. There are two ways
that scenarios are frequently used in research, and
both of them are used in the paper you read earlier
this term on what kind of Brazilians voted for
Bolsonaro in 2018, which is why I assigned this
article. First, there are a couple of paragraphs in
the article that compare hypothetical individuals
who are similar in all ways except for a couple of
characteristics in order to show which variables had
the most effect (partisanship and ideology) and
which had a small effect (sharing Bolsonaro's
illiberal views). Second, there several bar charts
showing how the probability that different kinds of
Brazilians voted for Bolsonaro changed if an
independent variable was at its lowest versus
highest value. Those bar charts were by using
minimum and maximum values as scenarios for each
variable when all other variables were held constant
at average value.
(7)
Important and not mentioned in the screencast: If
you want to create a scenario for a variable that
has multiple dummy variables and a reference
category, enter zero for the the relevant dummy
variables in the model and one for the dummy
variable you are looking at. If the group
that you want to look at is the reference category
(i.e., it wasn't included in the regression
models), then enter zeroes for the other
groups. For the example in the
screencast, to determine the probability that a
typical independent was going to vote for Hillary
Clinton in 2016, the scenario needed to enter zeroes
for the variables Democrat and Republican while
leaving the mean values for all other variables.
Because only partisan dummy variables in the model
were Democrat and Republican using zero for each of
these variables in the scenario returned the
predicted probability of voting for the typical
independent if the scenario values for all other
variables were left at their mean value.
Week 12 (11/5, 11/7): Mostly SPSS lab time.
After we have covered
linear and logistic regression in class, a
BlackBoard assignment on regression analysis
will be due Monday 11/4.
On Tuesday, we will finish up any remaining material on logistic regression and predicted probabilities. The remainder of the week will be devoted to lab time for you to complete conduct statistical analyses.
You will have a separate BlackBoard assignment covering the interpretation of predicted probabilities from logistic regression as well as how to make bar charts using these probabilities. This assignment will be due Wednesday 11/6.
Thursday will be devoted to lab time for you to complete conduct statistical analyses for your own projects similar to those have completed during the SPSS workshop. If you have any major concerns about you project, this is a good time to see Dr. Setzler!
Looking ahead to the start of Unit 4:
Week 13 is when you will start presentations that
include all of your statistical results. See the next
unit in the course schedule for details.
To make it easier to find things, I have broken up
the assignments calendar into multiple units. The
material for the next part of the course can be
accessed by going to the
course homepage and following the appropriate
links.