Finish the second part of the statistics project

Description

See attached. I already finished the first part.

Don't use plagiarized sources. Get Your Custom Assignment on
Finish the second part of the statistics project
From as Little as $13/Page

Unformatted Attachment Preview

Individual Empirical Project – Fall 2023
In this project, you will come up with your own substantive research question and try to
answer it with real data and the econometric methods you learn in this class. Your question
needs to be feasible, and just as important, it needs to be a question you care about.
Group Size
You can work individually (recommended) or in teams of 2.
Time frame
This project has 3 parts. The due dates for each written submission are:
Research Question & Annotated Bibliography write-up
Wednesday, Oct 11
Data Description write-up
Monday, Nov 13
Final report write-up deadline
TBA
To be determined by resgistrar, and announced in early October. Placeholder due
date: Friday December 8th, noon
Overall Project Grade
Your project grade will be computed using the following:
12.5% Research question write-up
12.5% Data description write-up
75% Final project write-up
Types of Research Questions – Examples
The goal of this project is to get you to use the tools we develop in class, in an application
that you care about. The core tool to be used is multivariate regression analysis. This tool
is used in many contexts; here are a few common categories (and there are many others!):
1. Estimating causal effects of a policy. If we increase the EITC, what impacts will
that have on labor supply? If the minimum wage goes up, what impact will that
have on wages and employment in the restaurant sector? When states adopted the
Medicaid expansion as part of Obamacare, what impact did that have on insurance
rates? These are all questions of causal impacts of policy changes. There are several
strategies to estimate these impacts, with one leading strategy that is well suited for
our toolkit being:
a. Difference-in-Difference analysis. This is one strategy that can be used to
estimate causal effects of a policy. To do so, you need to identify a “before”
and “after” period (corresponding to when the policy was implemented, or
significantly changed), and a “treated” and “control” group. (These groups
could be different geographic areas, or could be different demographic
groups.) We will discuss this strategy more in class.
2. Documenting and understanding demographic disparities. There are a lot of
unequal outcomes in economic and social life, and our tools can be used to better
understand these. For example, in the US women’s wages are on average about 80%
of their male counterparts’ wages. Our regression tool can be used to examine
questions about whether this is driven by differences in age, education, occupation,
parenthood, etc.
3. Predicting an outcome or outcomes. Sometimes we are most interested in
making a prediction. For example, when a judge decides whether to release a
defendant before trial, she wants to predict the probability that the defendant will
show up for the trial and not get into trouble beforehand. When a tax auditor is
trying to decide whether to audit a filing, they need to predict the expected amount
of tax fraud they will uncover.
a. What variables impact our predictions of expected outcomes? This is a
variant on the main question, where we are interested not only in predicting
an outcome, but in understanding how the predictions change as we compare
different values of explanatory variables. E.g., what is the relationship
between firm size and expected amount of tax fraud?
Data
OPTION 1:
I recommend using data from the IPUMS project: (https://www.ipums.org/). This is an
amazing collection of micro datasets from both current and historical periods for the US
and across the world. If you use one of the IPUMS datasets, you will need to build in some
time to get an account approved, and allow for a round or two of learning what data you
want to download and clean. This source of data will require a bit more work than the
choices below, but is likely to provide the greatest learning opportunity.
OPTION 2:
You can also use one of these rich data sets that are available on the course web site along
with documentation:
● The Consumer Expenditure Survey (CE) is collected every quarter by the US
Bureau of Labor Statistics from a large nationally representative sample of
American households. It includes numerous measures of income, purchases,
household characteristics, and individual characteristics. This data is used to by the
government to compute poverty thresholds for social programs and revise the
Consumer Price Index (CPI) market basket of goods and services. Businesses and
academic researchers use the data to learn about consumer spending habits and
trends. I’ve posted a large subset of the CE data collected in the first quarter of 2013.
● The General Social Survey (GSS) has been collecting data on American knowledge,
opinions, and behavior every two years since 1972. It asks many of the same
questions every round giving us our best source of data for examining trends in
these factors. It also asks new questions on timely issues every round. I’ve posted
the data and documentation for the 2012 wave.
OPTION 3:
With my pre-approval, you may also use data you find yourself, though this is often
a more difficult proposition. If you want to go this route, you need to obtain my written
approval by Friday, October 6th. (To do so, you will need to reach out to me in advance of
that date with your idea, and most likely meet in person during my office hours in advance
of this date). You will not be allowed to conduct any experiment or survey that would
require approval from a Human Subjects Review Board.
Part 1: The Research Question & Annotated Bibliography
You should write about 400-600 words discussing the question you hope to answer. Why is
this question interesting? What do you think the answer might be and why? I encourage
you to look at the documentation for the data sets listed above for ideas. Here are a few
examples of feasible questions:
How do characteristics of owners influence the revenue firms make?
How does formal education affect people’s knowledge of current events?
What kinds of families have multiple cars?
Does smoking affect grades in high school?
What characteristics of countries determine how many gold medals they win at the
Olympics?
● How do different NBA player performance statistics (e.g., points or rebounds per
game) influence salaries?





There are four issues you may have to address in your research question:
1. Population of interest: You might want to focus your study on only certain groups,
geographic areas, etc. This may be only a subset of the population covered by the
dataset. You want to be clear about what you population of interest is.
2. Causality vs. correlation: Are you trying to estimate how changes in one thing (call
this an independent variable) affect another thing (call that a dependent variable)?
This would be a statement about causation, and as many of you have already heard,
correlation does not necessarily imply causation. It is much easier to describe how
two things might be correlated, but it’s much more interesting and useful to
estimate causal effects.
3. Time ordering: In general, if you are estimating a causal effect, your determinants
should be established before the measures of what they might be affecting. If you
are looking at the effects of formal education on knowledge of current events, you
should make sure formal education has been completed before you measure
knowledge of current events. On the other hand, it doesn’t make sense to look at
how college education affects whether people smoke since most people who smoke
start before they finish high school, and few people quit smoking in college.
4. Confounders: Are there variables that might influence both your independent and
dependent variables and thus induce a correlation? e.g., People with high IQs might
be more likely to get more formal education and also keep up better with current
events. People from poor families might be more likely to smoke and also receive
poor grades in school. You will need to account for these potential confounding
variables in your analysis.
The question you choose is not set in stone–You will be able to modify or change your
research question later on if you need to.
Also, you should write up 2-3 pages (750-1500 words) discussing related literature. This can be
papers whose findings are related to your research question, or which motivate or speak to your
question. In your literature review it is very important that you connect every article you
cite explicitly to your empirical project—if you cannot articulate concisely the relevance of an
article to your research question, you should not be citing that article in your literature review.
The grading rubric for the Research question write-up will be based on the following:
Rubric for Research Question & Annotated Bibliography
Clearly stated research question (20 pts)
Policy or social science importance of question (10pts)
Feasible to answer? (20pts)
Relevant literature? (35pts)
Appropriate consideration of: Population of interest, Causation vs. correlation,
etc? (15pts)
Final Grade
Part 2: Data Description
Create your analysis sample from your raw data. What are the relevant variables? There
should be 10-15 variables incorporated into your paper. How many observations do you
have of each? You should have at least 100 observations and ideally quite a few more.
Describe each variable’s distribution using a tabulation if it is categorical or summary
statistics and a histogram if it is continuous. Are your continuous variables normally
distributed? At least one of your variables should be an indicator variable (only takes a
value of 0 or 1). Describe the distribution of one of your continuous variables conditional
on one of your indicators being zero and being one. Do not show histograms for categorical
variables. Talk about the variables themselves rather than the name or number they might
have in the survey (e.g., Don’t talk about q55 but instead talk about income or grades or
smoking status).
Formally test differences in means for some of your variables in two subgroups. e.g., Are
families with children more likely to own more cars than families without children? Do
countries in colder regions of the world win more gold medals per capita at the Winter
Olympics than countries in warmer regions? The results of these tests should give some
insight into your research question and motivate the regression analysis you perform in
part 3.
Your report should be primarily English sentences with supporting tables and
figures and should be no more than 2500-3500 words, plus Tables and Figures. It
should include the title, research question, motivation, and literature review that you
developed earlier, and you should incorporate any new ideas you’ve had since then and any
feedback you received. If your prior presentation of literature was in the form of an
annotated bibliography, then by now you should re-work it into a proper literature review.
Your tables should look pretty and should not simply be copied and pasted from Stata
output. In addition to your report, you should create and hand in a single Stata .do file that
performs all of the descriptive analysis.
Your grade for this part will be computed using the following rubric:
Rubric for Data Description
Question (10 pts)
Clear (3 pts)
Important (4 pts)
Feasible (3 pts)
Literature Review (5 pts)
Sample (15 pts)
Clear explanation of source of data (5 pts)
Describe unit of observation (5 pts)
Number of observations (5 pts)
Description of continuous
variables (30 pts)
(at least 3)
Include relevant variables (10 pts)
Means and standard deviations (7 pts)
Histograms (7 pts)
Clear English sentences & write-up (6 pts)
Description of categorical
variables (25 pts)
Include relevant variables (5 pts)
Tabulations (aka fractions of sample in
each category (10 pts)
Clear English sentences & write-up (10
pts)
Statistical Tests (15 pts)
Appropriate (10 pts)
Well explained (5 pts)
Final Grade
Part 3: Final Project
Use an appropriate combination of multiple regression models and (if appropriate) more
advanced methods to answer your research question. You should estimate regression
models with one to three different dependent variables and control for potential
confounding variables. Carefully interpret the signs and magnitudes of coefficients in your
preferred regression specifications. If there are additional potential confounders that you
do not have in your data, discuss how omitting them may bias your results. Discuss
potential intervening variables if you have any.
● Include the title, research question, motivation, and literature review you developed
earlier.
● Include the equation for your preferred regression model specification.
● Include tables of regression results (including coefficients and standard errors) in
your final report.
● The maximum length for your report is 15 single spaced pages of text (7500
words), and up to an additional 5 pages of tables and figures. You will likely
produce more during your research process, but it is important to know how to
identify the most important results. (Your list of references can be in addition to the
15+5 page limit.)
In addition to your final report, you should submit a single .do file that performs all of
your tests and regression analysis, and if it is not too large upload a copy of your analysis
data file (this is the file you will run your analysis on, after cleaning and organizing your
data). If your analysis data file is very large (say, N > 30,000), instead upload the output
from “describe” and “summarize” results for the dataset.
Final Project Grading Rubric
Question (10pts)
Literature Review (5
pts)
Basic Summary Stats &
Descriptive results (10
pts)
Regression models
(50pts)
Clear (3 pts)
Important (4 pts)
Feasible (3 pts)
Well-connected, sufficient motivation (5 pts)
Appropriate to motivate/give context for research
(5 pts)
Well executed and described (5 pts)
At least 3; overall quality of write-up. (5pts)
Dependent make sense? (5 pts)
“Main” independent variables make sense? (5 pts)
Include potential confounders? Hope yes (10 pts)
Include big intervening variables? Hope no (5 pts)
Interpret signs and magnitudes (10 pts)
Interpret statistical significance (10 pts)
More advanced
methods (up to 5
bonus points)
Discussion/Conclusion
(25 pts)
Appropriate & Well explained (5 pts)
Main take away (10 pts)
Limitations of approach (10 pts)
Policy relevance (5 pts)
Final Grade

Purchase answer to see full
attachment