SENIOR SEMINAR


CONTACT      COURSES       RESEARCH      STUDENT RESOURCES




Unit 3: Using SPSS to analyze datasets (materials for the other course units are accessible from the course homepage).

Materials you will find useful for this unit:

The course's big deliverables:


Thesis assignments (these may be modified as their due dates approach, so don't print them out way ahead of time)
:


Professional Development Assignments
(these may be modified as their due dates approach, so don't print them out way ahead of time):


Some initial observations about this part of the course

  • Starting when we get back from break, a significant block of class time will be spent completing SPSS workshops that (re)teach you how to analyze different types of variables and their relationships with one another using SPSS. The materials you will need for the workshops are in a folder in the PPTs/Assignments folder that is linked from the course homepage (or they soon will be if they aren't ready yet). The workshop assignments will be uploaded in parts to Blackboard after we have reviewed the relevant materials in class. Keep in mind:

    • No one expects you to be an expert in either advanced statistics or SPSS going into this block of material. I know that it has been a long time since many of you have taken a research methods class or used SPSS, and it is not unusual to have some students taking PSC 2019 and PSC 4099 at the same time. You will have lots of practice computing, visually summarizing, and interpreting statistics before you will need to complete the major thesis assignments that require advanced statistical work.

    • All of the resources you will need to do the statistical work for your thesis are in or linked to this schedule. As you can see below, I have put a lot of details and all kinds of links to resources into this part of the course schedule. Fear not. All of the materials are listed here because I want you to have all of the resources you may need for your thesis located in one place.

    • I am aware that students have different learning styles, and the multitude of resources below reflect this understanding  Everything that you need to be able to do to complete the statistical work for your senior thesis is covered in screencasts and other materials that are linked below. And we will be reviewing almost all of these materials with hands-on exercises during the workshop..

  • Outside of class, you need to continue to make progress on your own project to meet several deadlines.

    • In week 10, the second week back from break, you are going to give your first presentation. The presentation is going to require you to summarize the front end of your thesis project and to demonstrate that you have coded your study's dependent, independent, and control variables correctly.

    • In week 11, a week after your presentation, you will need to submit Thesis assignment #4, which is a draft of your study's reasearch methology. If you would like to review an HPU student-authored sample of a descriptive statistics table and a well-written methodology section, see either of these paper:

Madison Deane '2024: "Fate, Randomness, and Economic Policy Attitudes." See pp. 7-9.

Maggie Selman '2024: "Beyond the Binary: How Knowledge from Religion, Science, and Personal Contact Influence Hostility Toward Transgender Rights." See pp. 7-10


Week 8 is the mid-term break: No classes

Week 9 (10/15, 10/17)

Note: You will have an SPSS assignment due at the end of week 9 on Friday, October 18 at 5pm. It will be posted on Blackboard. This assignment will cover the material from the week's workshop.

The first week of the SPSS workshop
will cover several statistical techniques necessary to produce the type of "descriptive statistics table" that social scientists typically use to summarize the central tendency and distribution of their project's variables. You will need to use these skills to create a table that will be submitted as part of the presentation you will deliver during Week 10. At that time, you also will need to submit a draft copy of your codebook (that you will create with this template). If you would like a good example of a descriptive statistics table, see page 9 of Maggie Selman's paper, which linked above. The same paper includes the study's codebook in Appendix A, which starts page 21.

This week, we will review what a standard deviation is and what information it tells us. We will also talk about why categorical and ordinal variables typically are recoded so that every response (or subset of collapsed responses) are recoded as dummy variables. Social scientists can and frequently do analyze categorical and ordinal dependent variables; however, we will not be covering these types of regression.

We also will review a couple of bivariate analysis techniques. Researchers often begin to look at the relationship between their independent and dependent variables with figures visually showing how different categories of their independent variables correspond to different values for their dependent variable/s in the sample they are working with. So, we will practice using SPSS in combination with Excel to create bar charts showing showing how different independent variable groups are distributed on a dependent variable. As you will see in the research article you will be re-reading for this week (the one on why so many Brazilians voted for an illiberal presidential candidate), researchers also sometimes combine bar-chart figures with confidence intervals to both show and measure the relationship between their independent and dependent variables. You will not be required to create this type of a figure for your thesis and we will not practice doing so. However, below, you will be given the resources to make this type of a bar chart in case you have time to incorporate this type of work later in the term or for a conference paper. 

Bar charts can only show us how different values or groups on our independent variable/s may be associated with the values of dependent variables in our sample. To determine how consistent the relationship is between each independent and dependent variable in a study, social scientists typically employ formal statistical tests. Specifically, these tests assess how likely it is that what we are seeing in our sample would be found in repeated sampling. You learned about the most common bivariate association tests in research methods (e.g., chi-square and T-tests), but our workshop will focus on just bivariate correlation analysis because this is the only type of bivariate statistics that will be required for your thesis work. Using this statistical procedure is critical because it allows us to anticipate problems with multicollinearity, which is when two or more of our independent variables are so closely correlated that a regression model will not be able to accurately assess the influence of each independent variable by itself.

Now that you know what we will be doing in week one, here's what I would like you to do before we start this week to get the most out of our class meetings:

  • Before our first meeting this week, please make sure that you have printed out out and are familiar this article: "Did Brazilians Vote for Jair Bolsonaro Because They Share his Most Controversial Views?" 

    Why do I keep asking you to use the same sample article?

    • First, I value your time and know that most of your reading in this class is for your own project. Using just a few sample articles is meant to cut down on your "homework" time.

    • Second, the inspiration for this article occurred in a senior seminar class a few years ag0; this study was written to be a senior-seminar length manuscript that mostly uses only the statistical techniques that we learn in this course (and those that we don't learn are covered in optional screencasts below).

    • Third, the article provides a sample methodology section for you to consult as you work on Thesis Assignment 4, starting on page 4 (the one thing related to your assignment that is missing from the article is a summary statistics table).

    • Finally, the article provides a good example of how you need to pay close attention to how the theory, methodology, and findings sections fit together: Notice that the same main arguments and theory-derived hypotheses and variables are the focus of each section of the article. Notice also, that the ordering of the theories, their hypotheses, and the variables used to test those hypotheses is the same in each section. Incidentally, one  mistake in this article is that the author (me) has the variables listed in reverse order in the figures and tables.

  • As we are working this week and next, an perhaps again later in the term, you likely will need to review at least some of the screencasts and other material linked below. If you are prompted below to read or watch material before we meet in class to talk about a statistical method, please do the work or you may well find that it is difficult to understand what we talking about in class.

  • Much of the material below is optional. but it should be reviewed if you need more guidance of particular methods. The students in this class have varying degrees of experience and comfort using SPSS, so you will need to tailor your class preparation this week and next based on how familiar you already are with using SPSS for univariate analysis, correlation, linear (aka OLS) regression, and logistic regression.


Here are a set of resources for the statistical concepts and SPSS methods we will be covering in the first week of the workshop:

Making a descriptives statistics table and interpreting its data

  • Important: Every student will need to include a descriptive statics table in their presentations and thesis. The table will need to include sections with your dependent, independent, and control variables. Here is a sample of what a descriptives statistics table should look like. This table came from a study that Drs. Setzler and Yanus presented at a conference. The paper looked at how much the attributes that predict sexist views also predicted voting for Donald Trump in 2016. This study had no control variables, so none were listed in the table. Incidentally, a revised version of the paper was published

  • Like the other statistical methods we will be reviewing in detail, using SPSS's commands to generate descriptive statistics is covered in the third of the three instructor-compiled handouts on using SPSS for senior thesis work. If you want to read more on how to use SPSS's point-and-click method to generate univariate statistics, here's a document from another source that goes over those procedures.

  • Some important ideas to know about descriptive statistics and what is reported in a desriptive statistics table:

    • What needs to go into the table? At a minimum, descriptive statistics tables typically report each variable's maximum and minimum values as well as its mean and usually its standard deviation.

    • For your thesis work, you will also need to report the number of observations for each variable because this information is how we will verify that you do not have a serious issue with your coding (or that you have one or more variables that were only asked of part of the sample. If you have a variable where a large share of the respondents are missing data, make sure you have coded that variable correctly. If a variable was only asked of part of your sample, you will need to think through how you are going to deal with the situation.

    • It also is standard practice for descriptive statistics tables to report standard deviations. The standard deviation is a widely used measure (in all fields) for looking at the distribution of an interval variable across its range of values.

A straightforward way to explain a standard deviation (SD) is to say that it is the measure of how far away from the mean most respondents' answers are scattered. Answers that are within one standard deviation of the mean are somewhere in a range that is approximately a third below to a third higher than the average respondent's answer..

Comparing both means and standard deviations for different groups is useful. For example, if Professor Washington's advisees all have an average GPA of 3.0 with a standard deviation of zero, we know that each of his advisees has received a B in every one of their courses so far. If Prof. Jones's advisees also have an average GPA of 3.0, but the SD in the GPAs of her advieess is 1, The SD tells us that around two-thirds of her students have GPA between 1.0 and 3.0. In short, there is no difference in the GPA's of the students advised by Professors Washington and Jones, but the distribution of advisee grades is higher among Professor Jones students. If you were a very strong student, you might prefer to be one of Jones advisees, since lots of them are earning A grades. If you have been earning lots of Cs, you might want to be advised by Professor Washington, since his advisees appear to earn Bs in every class.

    • Remember how to interpret the means of dummy (aka "binary," "dichtomous," and "0/1" variables. By convention, we report the standard deviation for each dummy variable, too, even though the SD information for a dummy variable is not useful. By itself, the mean value for a dummy variable reports its distribution in your sample; e.g., a value of .37 for the dummy variable Democrat indicates that 37% of the sample identifies as a Democrat.

    • Descriptive statistics tables report means, so what do you do if you have an independent variable that is an ordinal variable? You have two choices. You can treat is a continuous variable and report its mean (that is what is done for several variables in the article on Brazilians voting for Bolsonaro that you read). Alternatively, can create one or more dummy variables out of an ordinal variable, which is often done when levels of education are a key independent variable.

    • The means and standard deviations of a categorical variable are useless. Categorical variables should be recoded and analyzed as separate dummy variables in the descriptive statistics tables. For example, if you have party variable with three categories, a mean of 1.3 doesn't mean anything, so you'd want to create three dummy variables, one for each party, if putting this information into the descriptive statistics table. If your analysis would benefit from also summarizing the distribution of a categorical (also called "nominal") variable, you will want to make a separate table or figure reporting its "frequencies" (in percentages) because the mean and standard deviation of a categorical variable does not communicate useful information for these types of variables.

Even if you already know how to do this, there are some handy shortcuts covered in the video that may make watching it worth your time:

(1) Use SPSS to generate the results you need for a descriptive statistics tables; Analyze -> Descriptive Statistics -> Descriptives. Then, select just the variables you want and use the check-boxes in "options" to generate results for only the mean, standard deviation, min. value, and max value. These statistics are generally reported in the order just listed in a table where each variable is in a separate row and the four statistics are each in a separate column.

A pro trick: For the step above, when using a PC, you can select multiple variables at a time to add or remove from the list of variables you want to analyze by holding the control button down as you each of them select them. You can select a range of variables by holding the shift key down as you select variables. For a Mac only, depending on your trackpad settings, you usually need to press the control button in combination with the shift key to select a range of variables.

Here's an even more useful trick: To make it faster to find the variables you want to analyze, you can change the view of the variable list so that variable names are listed alphabetically. To do so, hover over the variable list and use a right-hand mouse click  and select the option to see variables names. Repeat this step to order them alphabetically. For a Mac, only, again depending on your trackpad settings, what is a right-hand mouse click on a PC typically involves pressing the control button while clicking your trackpad.

In fact, it is so fast and easy to sort and find variables in the descriptives statistics window that this usually is the best way to create a list of your study's variables any time you need one. Just point-click-and-the-paste a descriptives command. In syntax, you can move the variables around if you think you are going to want them in a different order.

(2) As noted above, for a 0-1 coded (i.e., "dummy") variable only, calculating its mean will reveal the percentage of respondents belonging to the category. For example, if you have a variable coded 1=female. 0= male, a mean of .326 indicates that 32.6 percent of the sample is female.

  • The fastest way to create descriptive statistics for a variable's subgroups (i.e., for each response category) is to split the your dataset on that variable and then run the descriptives command. You could use this strategy, for example, if you wanted to compare different racial/ethnic groups' average income or the frequency at which people who identify with different political parties agreed with a particular statement

  • After we have analyzed subgroups in class, if you need additional information on splitting a dataset and running analysies, watch this optional screencast: https://youtu.be/YWgz0bKcq-M (a little over five minutes). This technique allows you to easily generate descriptive statistics--including means--across different groups. The example in the video look at mean levels of support for torturing terrorism suspects for people belonging to different political parties

Some things to remember from the video:

(1) To tell SPSS that your analyses should be run on different groups for a variable;

   Data-> Split file ->Compare groups.

Then, select a group variable to work with. In the video, the data is "split" so that analyses will be run for individuals grouped by their partisanship.

2) Once you are done analyzing groups, you MUST turn off the group comparisons to get SPSS to go back to normal:

   Data-> Split file -> Analyze all cases.

If you don't do this step, your subsequent analyses will keep analyzing subgroups

(3) And something important I left out of the screencast: Sometimes, you will want to create a variable to use just for this procedure. For example, in your analyses, you almost always want to use dummy variables for party identification (i.e., create dummies for Republican, Democrat, and Independents). If you wanted to split your dataset by partisanship so you could generate descriptives statics quickly for a table comparing how Democrats and Republicans answered questions in the survey, you could create a categorical variable (e.g., Dem=1, Rep=2, Indep.=3) for a split your data so that a descriptives or frequency command would report results first for Democrats, then for Republicans, and then for the other groups. If you do not create this extra variable and instead split your data on your Democrat dummy, the stats you will get for the non-Democrats would include Republicans and independents rather than the Democrats-vs.-Republican stats you are looking for.


Making bar graphs in Excel that visually show the relationships between two or more variables.

  • Important: Every student's thesis and presentations will need to include a bar chart examining the relationship between their independent and dependent variable/s (if you have control variables, they do not go into these analyses).

  • Here is an example a bivariate bar chart; it comes from a study looking at why women and men differed in their level of support for civil liberties in the US and Canada over a ten year period. When you look at the table, consider how much easier it is for the reader to analyze the association between gender and support for civil liberties in a figure than would be the case with a complex table listing lots of percentages.

  • When you are comparing how different groups vary with respect to another variables (for example, the percentage of women vs. men, Democrats vs. Republicans, and whites vs. non-whites who voted for Joe Biden in 2020, the most visually appealing way to do so is to generate means or frequencies as instructed above, and then to combine your SPSS results for subgroups into a single, nicely formatted chart created in Excel (and then added to your paper or presentation via a screenshot of the figure).

  • In class, we will review how to use SPSS and Excel together to make a bar chart. If you need a refresher after class, watch this optional screencast (https://youtu.be/T6kHpZ2oReQ) to review how to use Excel to make bar charts from frequency or descriptives output. In the example, data from four separate, unattractive SPSS frequency charts are combined into a single figure that compares the percentage of men, women, Republican, and non-Republican voters who voted for Donald Trump in 2016. Important: These are the kind of frequency tables that should be in your thesis and presentations rather than SPSS-generated figures.

                        Some things to remember from the video:

(1) The fastest way to generate statistics for different subgroups is to use the split file option described above. If you want to compare mean values for different groups (say, average years of education), split the file for the first group (say men vs. women as in the example video) and run descriptives->frequencies. If you want to compare the percentage of each group that did something, run descriptives->frequencies. For the later of these, you want to use the "valid" response data in your Excel chart

(2) Enter the data in Excel so that the auto-generated chart (insert->choose bar chart) will be able to see how the data is organized with respect to the labels. In the example in the video, we have something like this (note that the partisanship and gender labels are each centered in "merged cells" that span the group categories under them:

Partisanship Gender
Republican Non-Republican Men Women
Vote Trump 85.0 21.1 41.1 35.4

(3) Two important things to note: First, with frequencies, your charts should note the percentage of respondents in each group (reporting the count for frequency tables and charts makes no sense). Second, there is no need to include two columns of data when one will do. In this case, it just clutters things up to add a second bar to each group noting the percentage of voters who didn't vote for Trump since that number is obvious from just the bars for the share who did vote for Trump.

(4) After you have auto-generated the chart, you can double-click on the bars to access options to change their color, width, etc. Once you have the chart editing options open, you can also use the "add chart element" to add labels to y-axis (in the example, we needed to note that the vertical axis referred to the percentage of each group who voted for Trump.


(5) To change the format of the numbers in your chart axes or in any numbers show on the chart, you should do so by changing the format of the numbers in your original data (to do so, select all of the data -> right-hand mouse click -> format cells -> number... and select how many decimal places you want (including none).

(6) This is not mentioned in the screencast, but one thing to know about making bar charts in Excel is that if you make a figure with vertical bars, your data rows in the spreadsheet will be used to create bars (running left to right) in the same order (top to bottom) as your your data is listed. However, it is better to have horizontal bars if you are going to have a lot of bars, Unfortunately, if you make a bar chart with horizontal bars, Excel puts the data from the top spreadsheet row in the bar closest to the X-axis, which is to say at the bottom of figure. So, if you want your variables to run in a certain-top-to-bottom order in a bar chart with horizontal bars, list your data to go from bottom-to-top in the spreadsheet. You would think you would be able to click a button to reverse the order bars, but that isn't an option.

Using bivariate correlation analysis to look at the association between two variables and to identify potential problems with multicollearity later on.

  • Important: Every student's early presentations will need to include a correlation matrix that includes each independent and (if you have them) control variables. You will not include a correlation matrix in the final presentation or paper, and you should see Dr. Setzler before analyzing the table's results in your papers or talks. The matrix can be an SPSS results screenshot rather than a formatted table. This is an in-class verification step to make sure that you do not have two or more independent variables so closely correlated that you will have problems in regression analyses (more on multicollinearity below).

  • Use the SPSS command Analyze -> Correlate -> Bivariate to look at the association between any two interval variables (including dummy variables). If you have little idea what correlation or a best-fitting/regression line is, take a look at this brief summary or watch the first few minutes of this screencast (https://youtu.be/X2cbmF-SR3I; 14min 05sec)

  • Print out and review this one-page handout of annotated SPSS output for correlation.

  • Here is a summary of key ideas to know about what a bivariate correlation statistic is and does: 

    • Note that you will need to summarize *any* bivariate correlations in your presentations or these findings papers.

    • The main reason we be reviewing and using correlation analysis is to identify any independent variables in your project that are so closely associated that regression analysis for these variables may not work well in a regression model. Also, many of the  concepts involved with correlation apply to the regression analysis statistics you will be using in your projects.

    • Remember that a correlation test measures how consistently and in which direction, but not how much, an increase in one dummy/interval variable predicts an increases or decreases (if the correlation is negative) the value of another dummy/interval variable. Note, if we want to know how much an increase in one variable, by itself, corresponds to an increase/decrease the value of another variable, we would need to run a regression analysis.

    • Association measures, including correlation coefficients, also report a significance statistic. The significance statistic is reported as a p. value. We use a p-value to determine whether any association uncovered in the sample is small enough that an association between these two variables would be reliably found in repeated samples of the larger population. With a significance statistic that is SMALLER than .05, we can say with confidence that there is a meaningful association; if the significance statistic is LARGER than .05, we can not be confident that we would consistently find an association between our variables in repeated sampling of the larger population.

    • How do you interpret the size of the correlation statistic?
      • <.10 means that there is a very weak association between the variables;

      • .20 can be interpreted as a meaningful but modest association; .

      • .30 is a moderate association, and

      • >.40 is a strong association,.

      • But, in every case, you need to put correlation findings into context (e.g., a .40 association between being Republican and being conservative would be a much weaker finding than you would anticipate, so it wouldn't make sense to refer to this scenario as being evidence of a very strong association). Most of the association statistics range from -1 to 1, and any negative statistic means that increased values in one variable is associated with declines in the value of the other variable.

      Correlation and regression make some assumptions about the structure of the involved variables--specifically, they must have a linear relationship. Measuring two variable's correlation--whether with bivariate correlation or some type of linear regression--means that you are assuming that the relationship between one variable and another is linear. This means that you think that an increase in the value of one variable consistently causes the same amount of increase or decrease in another variable. It is important to remember that two variables can be closely related to one another without being linear. As an example, age and income from work are related, but the relationship is curved; each year of life after, say 16, corresponds typically to an increase in wages up to a certain age, let's say 60, and then wage income typically declines with each additional year of age. Similarly, in saving for retirement, your investments at a young age will generate some returns, but if you stick with it, the gains will be much greater later on because the relationship between time and savings is exponential, not linear. In short, if you have reason to think that there is a relationship between your independent and dependent variable that isn't linear, see me for assistance; there are statistical ways to handle non-linear relationships.

    • You may have heard somewhere that "correlation is not causation." It's correct. Correlation (or regression for that matter) can't tell us that an increase in one variable is causing an increase in another variable if it is plausible that the reverse is likely (i.e., y causes x is perhaps as likely as x causing y). It is our theory that guides what we think the independent and dependent variable are.

    • Moreover, even if two variables are highly correlated, it could be that there is another variable that could be causing both x and y to change in predictable ways even though they have no actual relationship. Crime goes up when ice cream consumption goes up in a population, but they don't have anything to do with each other except that both are more prevalent in the summer when there are many more hours of sunlight. "Omitted variable bias" refers to a situation where the correlation between two variables is spurious, which is to say due to a third variable. Concern about omitted variable bias is why we will be talking about multivariate regression models later on.
  • Some key ideas to know about multicollinearity

    • What exactly is the problem with multicollinearity? Imagine that you think that one of two men who are close friends is committing a certain type of pretty unique crime in the evening. He clearly isn’t the only committing this type of crime because sometimes there have been incidents where neither he nor his friend has been present. Your suspect and his friend hang out together a lot, and when they've been in an area at night, there frequently is evidence the next day that the criminal act has occurred. You would like to pin the blame on your main suspect, but you have a problem. While you believe that your main suspect has been doing the all of crimes and his friend hasn’t doing any of them, it's also possible that both men have been committing these crimes, taking turns. Or they both could be doing the crimes together. Or, it may be possible that you've identified the wrong guy entirely and it's the second man doing all of the crime that happens when they are together. To reliably test your hypothesis that it is just the first man committing crimes, ideally you will have lots of instances where only one of the guys was in an area on a given night, so you can see if a crime happened that night. If they are such good friends that you don't have very many these observations, you many have to determine who is guilty by using just a few observations (something we try to avoid with statistics) or by splitting responsibility and assigning the crime to both of them equally whenever both were present (which could be flat out wrong). That’s what multicollinearity is, and it is why we look at how correlated our independent variables are with each other.

    • How do we know how much each independent variable is correlated with each other independent and (if we have them) control variable? Before you run any regression model, you should use Analyze -> Correlate -> Bivariate to create a correlation matrix that includes all of your independent variables. 

    • If two independent variables or any pair independent and control variables have a correlation of higher than .7, you want to think very carefully about interpreting these variable's results in a regression model, especially if the results look different than what you would expect. If you have a very large sample--let's say 4000 observations or more--even with two, quite highly correlated variables, you are still going to have lots of observations where the variables are not moving in the same direction all of time. However, in a smaller sample of 1500 observations or fewer, you need to consider the possibility of multicollinearity distorting your results for the highly correlated independent variables.

    • If you have two independent variables that are really highly correlated (.80 or higher), unless you have a very large dataset like the CCES, you will either need to combine them (perhaps into an additive index), or eliminate one of the variables because the two measures are effectively capturing the same thing even if it is not clear why. We will talk specifics later on when we turn to interpreting regression models, but here jut note that the decision to drop a variable to avoid multicollearity (let's say that being conservative and being Republican are very highly correlated, which is the case these days), typically is based on a combination of its relative theoretical relevance and how removing specific variables changes the explanatory power of the model as a whole (i.e., its r-square statistic).


Week 10 (October 22, 24): Student presentations

  • This week, you will give your first practice presentation. It should go a minimum of 7 minutes and no more than ten. The main purpose of this presentation is for you to get feedback on how the different parts of your thesis are coming together and for me to assist you in identify any variable coding problems.

  • When you present, you will need to present and turn in a descriptive statistics table listing, in order, your dependent, independent, and any control variables. All of these variables should been coded as either interval or 0/1 dummy variables.

  • Your presentations will not be graded, but your preparation and engagement while other students are presenting should be consistent with the professionalism/participation grade you hope to earn. Note that everyone must be present, regardless of whether you are giving a talk that day or not. You will have a graded preliminary presentation eventually, but that will be toward the end of the semester. 

  • You need to use a PPT (or similar program) for your presentation. For an idea of what your presentation PPT should look like, take a look at this sample presentation (I have removed the parts that covering statistical techniques that we will learn about later). It was prepared for an academic conference where Prof. Setzler presented findings a project you have been asked to read about several times. As you can see in the sample presentation, you want short summary points on the PPT rather than long block of material that you will read to the audience. It is important to use a consistently formatted PPT because it keeps you organized, allows you to look up at an outline on the big screen so you aren't staring at notes and instead look like you are trying to guide your audience.

  • You should address the following topics:

Introduction:

  • Quickly, what is your main research question?

  • Quickly, why is it interesting/worth answering?

Concepts and theory. Based on previous academic scholarship:

  • Quickly, what do we already know about your topic, and what do we still need to know?

Data, measurement, and hypotheses:

  • What do you anticipate the answer/s to your research question will be? Specifically, you want a set of clear, concise hypotheses.

  • Briefly, what dataset are you using? Where and when did it come from? Is there anything special that we need to know about the sample? Make sure to review the sample presentation to look at what you need here. 

  • Concisely, how exactly are you measuring/coding your dependent variable (what you want to explain), and independent variable/s (the things you think influence the outcome of your dependent variable),

  • Also, note very briefly your control variables (other influences that are of secondary or no importance to the study but that likely influence the dependent variable/s)

Preliminary findings:

  • You must have a descriptive statistics table, which needs to include the number of respondents for each item in your study as well each variable’s minimum and maximum values (this requirement helps me to identify coding issues). Later on, your descriptives chart should look like what you see in research articles (i.e., created in a word processing program rather than being SPSS output, like this example ). For the first practice presentations only, SPSS output is fine (let's make sure that you've got the right variables in there before you spend a lot of time formatting).  You do not need to explain this table. Show it, and say "Here is some summary information about each of the variable in my study. Let me give you a minute to look at it," and count 10-15 seconds out mentally so your audience has enough time to look over the table.

Please note, if your primary focus is to compare two or more groups (e.g. women vs. men or folks of different ethnicities, you may combine your descriptive statistics and bivariate analyses table (and only the table), For example, if you were looking to see if men and women have a different level of patriotism, you could generate summary statistics (i.e., means) for men and then for women across all of your variables.

  • You need to have at least one bivariate analyses, specifically, a bar chart that you must create in Excel. The chart show us what the relationship is between one or more of your independent variables and dependent variables before controlling for any other variables that may influence the relationship. 

Conclusion:

  • Very briefly, tell us what you are going to do next. For this first presentation, that can be a sentence that says, "The next step in this study is going to be to use regression analysis to get a better sense about how important each of my independent variables is in explaining X and whether what we are seeing in these bivariate analyses holds up once we isolate the influence of each of my independent variables.


Weeks 11/12 (10/29, 10/31, 11/5, 11/7): SPSS workshop on multivariate analysis.

During these two weeks, we will continue to review using SPSS to analyze different types of variables and their relationships with one another. Specifically, we will be using linear and logistical regression analysis to generate the statistics that social scientists typically use to explain and compare the influence of several different variables on some type of attitude or behavior.

  • On either Tuesday, 10/29 or 11/31 (a week after your presentation), please submit electronic and hard copies of Thesis assignment 4: A draft of the thesis section describing your hypotheses data, variables, and methodology (which  requires you to attach a revised version of codebook and a descriptives statistics table with information about each of your study's variables).  

  • After we have covered linear and logistic regression in class, a BlackBoard assignment on regression analysis will be due Monday 11/4.

  • You will have a separate BlackBoard assignment covering the interpretation of predicted probabilities from logistic regression as well as how to make bar charts using these probabilities. This assignment will be due Wednesday 11/6.


Here are a set of resources for the statistical concepts and SPSS methods we cover in the workshop.
Please note that you are asked to read or watch some of these materials ahead of class:

The materials in the first block of material provide an overview of doing linear (AKA OLS) regression analysis with SPSS. Most senior seminar papers will rely on logistic rather than linear regression; however,  the materials on logistic regression assume that you understand the core concepts of linear regression. 

  • Important: It is very unlikely you will need to include any linear regression in your thesis because thesis students typically use dichotomous dependent variables.

  • So, why review this linear regression? You will need to know how to compute and interpret linear regression output to do the workshop assignment. We are reviewing the type of regression you use with a continuous (interval) dependent variable because it is easier to understand several key regression concepts if you first learn about them using bivarate and multivariate linear regression. These ideas include: unstandardized regression coefficients, their slope, and their statistical significance; as well as the measure of overall model fit (r-square), line of best fit, standardized coefficients (betas), and how dummy variables and interactive terms work   

  • Ahead of meeting on linear regression, read and print out this handout of SPSS linear (aka OLS) regression output with annotations (the variables are on the same ones discussed in the screencast below, which looks at using linear regression to measure the influence of different variables on a person's level of support for torture (1 = "never justified"; 7 = "always justified"). The annotations remind you how to interpret R-square, a model's constant (aka, its intercept), unstandardized regression coefficients, and standardized coefficients for linear regression models.

  • After of our meeting/s on linear regression, you have the option of watching this screencast: https://youtu.be/X2cbmF-SR3I; 14min 05sec) It doesn't cover any SPSS work; instead reviews the basic concepts behind correlation, regression, and multiple (aka multiple linear or OLS-ordinary least squares) regression. The first five minutes explain bivariate correlation (i.e., the consistency of a positive or negative association between two variables is), while the next five minutes look at how we measure the typical impact of one variable on another (i.e., the regression slope). The final five minutes or so goes over multiple regression (i.e., how we simultaneously look at the influence of multiple variables).

  • After of our meeting/s on linear regression, you have the option of watching this screencast: https://youtu.be/xzl8OxPsM8s; (about 11 min) that walks you though through the process of using SPSS to calculate and interpret the output of a linear regression model (the model looks at a 7-point measure of support for torturing terrorism suspects to gain information. (Note: At the start, the screencast references a longer version of the video that has since been pulled from my website).

What to remember from the screencast (but with made-up, extended examples to illustrate how different kinds of coefficients are interpreted):

(1) When you have a continuous/interval dependent variable and want to use linear regression, do this command: SPSS: Analyze -> Regression ->Linear. You will then need to identify your dependent variable and all independent and (if you have them) control variables. SPSS doesn't know or care whether you see a variable as being a control or independent variable, so put them in the model together (but in the same order as you have them in your descriptives statistics table).

(2) Really important: If your model is going to include dummy variables, carefully think through what your reference categories will be as you select the model's independent variables! Remember, if you use dummy variables, you have omit one category to use as a reference category. If you put in a dummy variable for males, the reference category in your output statistics will be non-males. It you put in dummy variables for both males and non-males, you do not have a reference category, and the last dummy gender variable you entered will be dropped by SPSS as it does its calculation, even though this may not be the variable you wanted to use as a reference category. Finally, If you put in dummy variables for both Republicans and Democrats, your reference category will be people who are neither Republican nor Democrat, which may not the comparison you are looking for if you, for example, have hypothesized that Republicans will be the most supportive partisans of using torture on suspected terrorism suspects. If you are making that hypothesis, your model should include dummy variables for independents and Republicans so that your results will show you how different each of these groups is from the reference category (Democrats) and whether those differences are statistically significant.

(3) Once you have your results, you will want to focus on two sections of the output. The first step in interpreting regression output is to use your SPSS output to figure out how well the variables in the model collectively explain variation in the dependent variable. To assess a model's "fit,” use SPSS’s adjusted R-squared statistic from the output. An example of interpretation: If a model's adjusted R-square is .365 and we are predicting how much a person supports torturing terrorism suspects on a 1-7 point measurement, we would way that, "the variables in this model collectively account for about 37 percent of the variation in how much people support torture." Alternatively, depending on what hypotheses were testing, we might interpret the same statistic by saying, "Most of what predicts the level of support for torture--that is nearly two-thirds of the explanation--for the dependent variable is due to factors not considered by this model" [1 - .365 = .635].

(4) Then, go the coefficients table at the end of the output, and look at each independent variable to determine if it is significant or not. With a regression analysis, the statistical test is assessing whether or not repeated sampling would find that the influence of a given variable is neither zero nor in a direction opposite of what the coefficient sign says (minus sign for a negative relationship and no sign for a positive one). So, the column “Sig,” lists the probability that a given coefficient is not actually different from zero or the relationship is signed in the wrong direction. By convention, we want the probability (aka, “the p-value”) of the coefficient to be meaningless or wrong to be less than .05. If a variable’s coefficient is greater than .05, we will not interpret that statistic because we aren’t sure if there is a real effect on the dependent variable.

(5) Next, we want to know how increases in each independent variable are predicted to influence the estimated value of the dependent variable when all other variables are held at a constant influence. To do so, we need to look at the unstandardized coefficients (they are listed in the “B” column), so that we can interpret each of the independent variables that are statistically significant (we typically will have little to say about control variables, if we have any, because they are in the model solely to make sure we have isolated the influence of our independent variables. For example, suppose that we are trying to explain what variables predict a person's level of support for torturing terrorism suspects on the seven-point measure. If we had unstandardized coefficients of .451** for our dummy variable Republican (assuming it is the only partisan variable in the model) and -.134* for edu4 (measured as a four-unit, interval variable), a suitable interpretation of these results would be:

"Compared to other non-Republicans (the reference category), Republicans' level of support for torture was around a half of a point higher on the seven-point scale, controlling for the influence of other variables. On the other hand, each increase in education modestly decreased support for torture. Respondents with the highest levels of education had around a half a point lower level of support when compared to the least educated" (i.e., each level of education reduced a person's score by -.137; thus going up three units--from 1 to 4--is equal to 3 x -.134 = -.402). Note that we always need to look carefully at what the reference category is when using dummy variables. If our model had included dummies for both Republicans and Democrats, the comparison here would be to independents, since they were the only partisan group not in the model.

(6) As a last step, we will want to consider how the predictors rank in their influence on the independent variable. What factors most determine the value of the dependent variable? We determine this by comparing the variables' "standardized" coefficients. which are listed in "Beta" column of the results. These statistics each measure how many standard deviations the dependent variable increases or decreases with each one standard deviation unit increase in the applicable independent variable (e.g., going from having an average level of education to roughly the 84th percentile). Most often, these statistics will be less than one, suggesting that one standard unit increase in a given independent variable corresponds to less than a one-standard-deviation increase/decrease in the dependent variable's value.

For example, when predicting support for torture--as the model does in the screencast--a one-standard- deviation increase in racism causes a much larger increase in support than a one-standard deviation increase in religiosity or education. In interpreting a model that had these models, we might say, "The standardized coefficients for the model indicate that the most powerful predictor of how much a person supports torturing terrorists is that individual's level of racial animosity."

  • Predicting regression scenarios in the write up of your findings is a way to provide analysis of your results that is a lot more interesting for readers than telling them what is obvious from looking at the table that summarizes the regression model.
    • The default regression table shows how a one-unit increase in a given independent variable changes the value of the dependent variable when the effect of all of the other variables is set at each variable's mean value. While that makes for nice tables, discussing specific scenarios can help to bring the data alive.

    • In a regression scenario, we change the values of one or more variables in the standard regression equation to something other than their means. For example, what is the level of support for torture on a seven-point scale for a 60-year old Republican male who attends church a lot versus a 30-year old female secular non-Republican, holding other variables constant at their means? To calculate the expected level of torture for the first individual, we would use the following formula:
      The model's unstandardized constant (i.e., the expected level of support for torture when all other variables are in the model have a value of zero, including variables whose range doesn't include zero)
      + (60 times the unstandardized regression coefficient for AgeInYears)
      + (1 times the Republican coefficient)
      + (0 times the Female coefficient)
      + (6 times the ReligAttend6 coefficient)
      + (mean value of Edu5 x the Edu5 coefficient)
      + (mean value of VotedLastElection x the VotedLastElection coefficient)
      = The expected value of TortureOK7 for this individual

And then you would calculate a second equation, changing the only the scenario variables' values (but not the constant, regression coefficient statistics, or the means for variables not in the scenario) to fit the characteristics of the other hypothetical individual. So, to calculate the level of support for torture for the hypothetical woman described above, the equation's value 60 would be replaced by 30. And, it would be 0 x the Republican coefficient, 1 x the Female coefficient, and 1 x ReligAttend6

    • After class and reading through the example I just gave, watch this optional screencast if you need more guidance on the process of using linear regression output and some basic math to calculate the expected value of your dependent under different scenarios:  https://youtu.be/3m66P8PaD3U. You also might want to review and print out this resource for future reference: a handout with the output and calculations covered in the video.

    • Pro-tip: If you are using linear regression in your thesis and have very many independent variables, it's a lot faster to calculate scenarios with the assistance of Excel. You can copy and paste in regression and descriptives results directly from SPSS output and then use Excel formulas to do all of the math. Here's an Excel spreadsheet with every variable set at its mean. Notice that you can just toggle variables to different values to create all kinds of scenarios. And here's a screencast on using the spreadsheet: https://youtu.be/A10yOJleGNw (Again, there's no reason at all to watch this screencast if you using logistic regression in your thesis; there is a separate spreadsheet and screencast specifically for logistic regression scenarios below).


This block of materials provides help with logistic regression analysis, which is used when your dependent variable is dichotomous. Most senior seminars involve this type of analysis. Even if you are only using this type of regression, you should review the previous block of material on linear regression because this method is an extension of those concepts.

  • Important: Every student will need to report some type of multivariate regression results in their theses and presentation; almost all theses will use logistic regression because student projects typically use one or more dichotomous dependent variables.

  • Ahead of our meeting on logistic regression, please take the time to read and print out: this handout of SPSS logistic regression output with annotations. The document is very similar to the one I posted above for linear regression, but this time the dependent variable--support for torturing terrorism suspects to obtain information--has been recoded into a dummy variable. Respondents who said torture was sometimes, often, or always justified were coded "1"; individuals who said torture is never justified were coded "0."

  • And make sure that you print out a copy of this one-page document, which is a sample of what a logistic regression table in a research paper for HPU political science classes should look like. The first column in the table shows you how to write up the results for the same model reported in the annotated SPSS output above. The second and third columns summarize the results (made up for the purpose of this exercise) of two additional regression models that the author has run separately, first on a male-only sample and then for women. Putting the three regression models into a single table is way more efficient than pasting raw SPSS output with lots of irrelevant information, and displaying the regression models side-by-side allows the author to see and discuss the different effects that his independent variables have on support for torture among men and women. If you have multiple dependent variables or are looking at regression models for different groups (e.g., one regression model for women and one for men, put the regression model results into the same table.

  • After attending the class where we talk about logistic regression, if you feel like you need to review more of the basic about why we need to use a special kind of regression to predict the likelihood of dichotomous outcomes, watch this 12 min. screencast (https://youtu.be/uUf3h8ifZxE) that explains in detail what bivariate logistic regression is, how it works, and why it is often used with ordinal dependent variables after they have been recoded into 0/1 dummy variables. This screencast covers concepts; the next one will look at how we use SPSS to run and interpret logistic regression models.

Helpful hints from the video (but with a completely made-up example to illustrate how coefficients with different values should be interpreted). While the video looks at what causes someone to agree that "generally speaking, men make better political leaders than men," in the example below I discuss the results from a model used in a paper I co-authored about who intended to vote for Donald Trump a month out of the 2016 Presidential election:

(1) How to run this type of regression. When you have a dichotomous dependent variable and want to use regression to predict how changes in an independent variable increases/decreases the likelihood of an outcome, you use this method:

SPSS: Analyze -> Regression ->Binary Logistic

(2) Really important: If your model includes dummy variables, carefully think through what your reference categories will be! (I included the same note above for OLS regression, but repost it here, because students often struggle with this concept). If you put in a dummy variable for males, the reference category in your output statistics will be non-males. If you put in dummy variables for both Republicans and Democrats, your reference category will be people who are neither Republican nor Democrat, which may not the comparison you are looking for if you have hypothesized that Republicans will be the most supportive of torture. If you are making that hypothesis, your model should include dummy variables for Democrats and independents so that your results will show you how different each of these groups is to the reference category (Republicans) and whether that difference is statistically significant.

(3) In the SPSS output, you will ignore most of it, going straight to the “pseudo” R-squared statistic first. This statistic tells you how well the variables in the model as a whole explains the "likelihood" of the outcome you are predicting. Report just one of the two R-squared statistics that SPSS lists in model results. The Nagelkerke is the most like OLS regression’s R-square, so use it. Note that researchers typically label this statistic "pseudo-R-square" when reporting the results of logistic regression because it is not actually the mathematical square of a Pearson's R (correlation) statistic, the way it is for linear regression.

Here's an example of how to interpret a pseudo-R-square:
Let's say the model of who intended to vote for Donald Trump (yes or no), gave us SPSS results including output with a Nagelkerke R-square of .142. In writing up our findings, we might say, "The model's pseudo-R-square (Nagelkerke) indicates that the predictors [i.e., the independent variables] in the model collectively account for just over 14% of the variation in whether or not someone intended to vote for then-candidate Donald Trump." Another way to interpret the same results would be: "While numerous studies have suggested that each of the characteristics in the model were important predictors of who voted for Trump in 2016, the model's R-square (Nagelkerke) statistic indicates that over 85% [i.e., 1.0 - .142 is > .15] of the factors that led only some individuals to vote for Trump lie beyond the indicators examined here."

(4) Then, go the coefficients table to look at the odds ratios to determine how much a one-unit increase in each independent variable changes the "likelihood" of the outcome you are predicting. when all other variables are held at a constant influence. Odds ratios are listed in the Ex(B) column of our SPSS output. An odds-ratio of higher than 1.0 indicates that increases in the value of that independent variable increase the likelihood of the outcome predicted by the model.

Here is an example of three factors that were significantly correlated with supporting Donald Trump in 2016; however, I am making up their coefficients to review how odds-ratios falling into three different value ranges are interpreted:

  • Let's say we were predicting whether someone intended to vote for Trump and our results had an odds ratio [Ex(B)] of 12.321 for a dummy independent variable identifying Republican respondents (assuming it is the only partisanship variable in the model).

  • For our second variable, our output reports an odds-ratio of 1.137 for edu4, which is a four-point measure of educational attainment (Here, we assume that it is an interval variable; if we didn't want to assume that each one-unit increase in education will the same effect on the dependent variable, we would need recode that variable in to a series of dummy indicators, say "more than high school," "college degree or more," and "advanced degree," leaving "high school or less" as our reference category).

  • Finally, we have a third variable--Torture7--created from a 7-point item asking how much respondents agreed with a statement that "torture should be used to obtain information from suspected terrorists. For this variable, our output reports an odds-ratio of .911.

To recap, a truncated version of our output for this example reads:

Variable Ex(B) (Make sure you get the right column's data!)
Republican 12.321
Edu4 1.137
Torture 7 .911

A suitable interpretation of the odds-ratio results would involve the following steps.

First, consider how you should interpret odds ratios with a value of 1-to-2. Odds ratios in this range indicate a positive relationship that effect that can be stated as an x percentage increase in the likelihood of the outcome occurring. For this example, you would say: "Each increase in education on a four-point scale increased the likelihood of voting for Trump by about 14%"   (i.e., 1.137 - 1.0 = 13.7). While it would be correct to say that each one-unit increase in education increased the likelihood of a Trump vote by 1.14 times, it is clearer and more elegant to express the same finding as a percentage increase.   

For odds ratios greater than one, the interpretation typically is expressed as an x times increase in the likelihood of of the outcome occurring with every one unit of change in the independent variable. In this example we could say, "Compared to all other respondents, the typical Republican was over 12 times as likely to vote for Trump, after taking the influence of other variables into account."

Odds ratios that are lower than one indicate a negative relationship. Students sometimes struggle with odds ratios between zero and one because the odds ratio is still a positive number. Think about what you would say if you had a dollar to start off with and after a bet you have 60 cents. You could say that you now have only 60% of what you had or that you have 40% less than what you had. In the example output, you would say, "Being more supportive of torture made a person less likely to support Trump. On its seven-point scale, each additional level of support for using torture corresponded to a 9 percent decrease in the likelihood of voting for him" (i.e., 1 - .911 = .089).

(5) Remember: odds ratios that are lower than 1.0 indicate that increases in the value of an independent variable decrease the likelihood of the outcome. These results generally are reported as percentages, which often are calculated either by subtracting the odds ratio from one). An example: Assume we are predicting whether someone intends to vote for Trump and we run a model that switches out the dummy variable Republican for the new dummy variable Democrat. Let's say our results have odds ratios of .101 for the dummy variable Democrat (again assuming it is the only partisanship variable in the model) and .913 for edu4 (as defined above). A suitable interpretation of these results would be:

"Compared to other respondents, Democrats were about 90% less likely to vote for Trump [i.e., 1.0 - .101= .899] after taking the influence of other variables into account. Increases in education also decreased the likelihood of supporting him. Respondents with the highest level of education were approximately 24% less likely to vote for Trump than persons with only a high school degree or less schooling.

For the new Democrat variable, another way to say the same thing, but use the odds ratio of .101 differently would be to say: "Compared to other respondents, Democrats were approximately 90 percent less likely to vote for Donald Trump, controlling for the other predictors in the model."

To calculate the statistic we make in the statement about education’s influence, we first need to figure out what the first unit of change does (going from Edu4 = 1 to Edu4 = 2) and then compound that effect to take into account the two additional units of change that distinguish people with the lowest and highest levels of education. In other words, we need to solve this problem: (1-.913 =.087) to the third power. If you aren't math inclined, Google will do this math for you, just do this search: "(1-.087) to the third power". The answer is 76.10. So, 100 percent - 76 percent = 24 percent, which is the statistic noted above.

How do we get to this statistic again? Going from 1 to 2 units of edu4, makes a respondent 91.3% as likely to vote for Trump. Then, those with a score of 3 on edu4 are just 91.3% as likely to vote Trump as those with a score of 2 (i.e. .913 x .913 = .834). Finally, those with a score of 4 on edu4 are just 91.3 percent as those with a score of 3: 91.3% of 83.4% as likely (i.e.: .913 x .834 = .761). Of course, this is the same result that you get by calculating (1-.087)^3, which is a result that you can have calculated for you if you just do a Google search for: (1-.087)^3.

(6) If appropriate for your hypotheses, consider how the predictors rank in how consistent their influence is on the independent variable. And here's something that is not covered in the video but may be useful to think about. There is some debate about how to best rank the relative influence of each variable in a logistic regression model, but one common approach is to compare their "Wald scores" much like we do with "betas" for linear regression. For logistic regression, this approach tells us which variables most consistently predict the outcome, but not the magnitude of their influence, the way a standardized coefficient does in linear regression.

  • Calculating scenarios from logistic regression results: As with linear regression, interpreting logistic regression analysis in in an interesting way is best done with scenarios. For the thesis and oral presentations involving regression, students who have run logistic regression models should take the time to see how much a minimum-maximum-change in each independent (but not control) variable changes the probability of believing or doing what your dependent variable measures.  

    • For linear regression, creating scenarios is a straightforward process because it is easy to use Excel to calculate specific scenarios from SPSS output (see the example above). Unfortunately, it is impossible to create scenarios with raw logistic regression output without doing additional mathematical transformations.

    • With logistic regression, scenarios involve calculating the predicted probability of doing or believing something at a given value of the independent variable, with the effect of all other variables held constant at their mean.

    • What is a predicted probability, and how is it different than an odds-ratio, the latter of which is listed in SPSS's default output? To answer this question, it helps to first understand that odds and probabilities are different ways of conveying the same information, and you can mathematically transform any odds statistic into a probability. Specifically, the odds of an outcome refers to the number of times the outcome is expected to occur compared to the number of times the outcome is expected to not occur. Probability is the number of times an outcome is expected to occur compared to the maximum number of times the outcome could possibly occur. 

The reason we go through the hassle of mathematically converting odds into predicted probabilities in logistic regression scenarios is that most people find probabilities a lot easier to understand. Per the results of a hypothetical scenario that will be posted below, the odds of a football team with two injuries winning their next game is .56 to 1, while the odds for a team with six injuries are 2.85 to 1. In other words, a team with six injuries should, on average, lose 2.85 games for every game they win.

For most people, it makes a lot more sense to convey the same information in probabilities, which can be expressed as percentages: A team with one injury has a 27.5% chance of losing, while a team with six injuries can be expected to lose almost 75% of the time.

An odds ratio tells us how much a one-unit increase in a given independent variable decreases or increases the odds of an outcome, which can then be converted into a predicted probability. An odds ratio of .600 says that each increase in the independent variable results in the odds of the outcome being .600 times what it was before the increase (i.e., 40% less likely). An odds ratio of 1.20 says that each increase in the independent variable results in the odds of the outcome being 1.20 times what they were before the increase (i.e., 20% more likely).

So, let's say you are planning to bet on a football team that has only 20% chance of losing when it has no injuries. Another way of putting this is to say that the team has 4/5 of a probability of winning. So, for every four times that team plays, we would expect it to lose once, which means their odds of winning is 4 to 1. When you bet on this team, you might place a four-dollar bet. If the team loses, you lose your 4 dollars; if they win, you get a dollar, plus your original four dollars.

What happens if your team experiences injuries during the season? How will each additional injury change the team's odds and probability of winning?

We could run a binary logistic regression model and its output might say that each additional injury a football team will increase the likelihood (specifically, the odds) of losing by 1.5 times. Using that odds ratio of 1.5, here is how the odds and probability of the team winning will change as each additional injury increases the odds of losing by 1.5 times:
0 injuries, odds = 4/1 (probability of losing = 20%)
1 injury, odds = 4/1.5 (pr. of losing = 27.3%)
2 injury, odds = 4/2.25 (pr. of losing = 36%)
3 injury, odds = 4/3.4 (pr. of losing = 45.8%)
4 injury, odds = 4/5. (pr. of losing = 55.9%)
5 injury, odds = 4/7.6 (pr. of losing = 65.5%)
6 injury, odds = 4/11.4 (pr. of losing = 74%)
7 injury, odds = 4/17.1 (pr. of losing = 81%)
8 injury, odds = 4/25.6 (pr. of losing = 86.5%)
9 injury, odds = 4/38.4 (pr. of losing = 90.6%

    • The bottom line is this: odds ratios allow us to communicate how each one-unit increase in an independent variable influences the likelihood of the outcome. However, most people don't refer to odds in everyday language, so it is a lot more intuitive if you explain regression results by talking about the probabilities of an outcome under different scenarios.

    • And in order to calculate scenarios like those in the football example above, we first have to transform logistic regression's output for the relevant variables. While you can do these transformations in SPSS, it is very complicated to do so (other statistical programs make it much easier). Fortunately, you can do all of the mathematical work in an Excel worksheet if you know what formulas to use, and a spreadsheet is a good place to manipulate different variable values to create scenarios anyway. Your instructor has put together an Excel spreadsheet for you that calculates predicted probabilities for you, depending on what scenarios you choose. 

What to remember from the screencast:

(1) First, point-click-and-then-paste a logistic regression command that includes all of your independnt and any control variables (plus your dependent variable). Don't run the command yet.

(2) Then, point-click-and-then-paste a descriptives command into syntax, using any random variable. When the command is in your syntax, copy and paste to replace the random variable with all of the independent and any control variables in your regression model syntax (omit the dependent variable). The point here is to make sure that you can run a regression model and create a set of descriptive statistics that will list variable results in the exact same order.

Two things to note that aren't in the screencast. First, you will make things easier if you delete STDDEV from the descriptives statistics syntax before you run it because you will need output only for each independent and control variable's mean, minimum, and maximum values. Second, I ran the descriptive command and then logistic regression in the screencast, but it will make it easier to find the results you need if you run the logistic regression first, because we are interested only very last block of its output.

(3) Select and run both of the commands, and then open up the Excel spreadsheet that your instructor has created to assist you in creating logistic regression scenarios (for the record, you can compute scenarios in SPSS or even using Google or Wolfram Alpha, but it will be much easier to use the spreadsheet I have created for this purpose). That spreadsheet is in the PPT folder and in one of the subfolders in the workshop materials.

(4) Open the last part of the SPSS logistic regression, specifically output block that lists the coefficients, by double clicking on it. Copy and paste the unstandardized logistic regression coefficient (the one in the B column) for the Constant into the appropriate worksheet cell . Then, copy all of the other variable names together with their coefficients (the ones right next to the variable labels in the "B" column), and paste them into the worksheet in the columns that are labeled for this output. 

(5) Now, it is time to work on the scenarios portion of the Excel worksheet. Go to back to the SPSS output and double-click on the descriptives output. Copy the means for all of the variables, and paste them into the worksheet column labeled "Scenario." We are doing this because we want to be able to create scenarios that involve some variables being set to certain values, while the remaining values are set to their mean values.

(6) Now, have some fun creating scenarios. There are two ways that scenarios are frequently used in research, and both of them are used in the paper you read earlier this term on what kind of Brazilians voted for Bolsonaro in 2018, which is why I assigned this article. First, there are a couple of paragraphs in the article that compare hypothetical individuals who are similar in all ways except for a couple of characteristics in order to show which variables had the most effect (partisanship and ideology) and which had a small effect (sharing Bolsonaro's illiberal views). Second, there several bar charts showing how the probability that different kinds of Brazilians voted for Bolsonaro changed if an independent variable was at its lowest versus highest value. Those bar charts were by using minimum and maximum values as scenarios for each variable when all other variables were held constant at average value.

(7) Important and not mentioned in the screencast: If you want to create a scenario for a variable that has multiple dummy variables and a reference category, enter zero for the the relevant dummy variables in the model and one for the dummy variable you are looking at. If the group that you want to look at is the reference category (i.e., it wasn't included in the regression models), then enter zeroes for the other groups.  For the example in the screencast, to determine the probability that a typical independent was going to vote for Hillary Clinton in 2016, the scenario needed to enter zeroes for the variables Democrat and Republican while leaving the mean values for all other variables. Because only partisan dummy variables in the model were Democrat and Republican using zero for each of these variables in the scenario returned the predicted probability of voting for the typical independent if the scenario values for all other variables were left at their mean value.  


Week 12 (11/5, 11/7)
: Mostly SPSS lab time.

  • After we have covered linear and logistic regression in class, a BlackBoard assignment on regression analysis will be due Monday 11/4.

  • On Tuesday, we will finish up any remaining material on logistic regression and predicted probabilities. The remainder of the week will be devoted to lab time for you to complete conduct statistical analyses.

  • You will have a separate BlackBoard assignment covering the interpretation of predicted probabilities from logistic regression as well as how to make bar charts using these probabilities. This assignment will be due Wednesday 11/6.

  • Thursday will be devoted to lab time for you to complete conduct statistical analyses for your own projects similar to those have completed during the SPSS workshop. If you have any major concerns about you project, this is a good time to see Dr. Setzler!


Looking ahead to the start of Unit 4:

Week 13 is when you will start presentations that include all of your statistical results. See the next unit in the course schedule for details.


To make it easier to find things, I have broken up the assignments calendar into multiple units. The material for the next part of the course can be accessed by going to the course homepage and following the appropriate links.