Effective data analysis requires both a solid foundation in analysis theory and technical skills with relevant data analysis software. In this course, students focus on building technical skills using Stata©, the dominant data analysis software used in economics, political science, epidemiology, sociology, and related policy-oriented fields. Students begin with basic commands and learn reproducibility and file systems. Much of the course focuses on intermediate skills required to understand datasets, prepare data for analysis, and represent data findings in tables and graphs. Throughout the course students will learn how to use AI effectively to speed the process of learning Stata codes and techniques and will be coached on ensuring independent coding competence in the absence of AI support. Students will undertake weekly tasks to build competence and will complete a final project relevant to their primary discipline.
Understand the differences between dataset types and their key features and uses.
Construct a Stata project architecture, including folder systems, master do-files, modular scripts, and log files
Prepare data for analysis, including importing data, interpreting labels, codebooks, missingness, and survey weights, cleaning, merging, reshaping, and other skills
Create output tables and visualizations
Demonstrate effective and responsible use of large language models (LLMs) for improving Stata skills without harming independent coding competence
Evaluate how data construction and visualization choices shape research findings
· Opening, saving, and managing datasets (use, save, replace, clear, tempfile)
· Core exploratory commands: summarize, tabulate, list
· Logical conditions and data manipulation (if, &, |, generate, replace, drop)
· Writing clear, annotated do-files and organizing project folders
· Using relative file paths to ensure reproducibility across computers
· Creating log files to document work
· Using Stata help files and AI tools for learning and debugging
· Defensive coding practices using capture and assert
· Macro vs. micro data
· Survey data vs. administrative data
· Sampling frames and population coverage
· Panel data and multi-level data structures
· Variable types (numeric, string, categorical)
· Survey weights and why they matter
· Importing data
· Appending datasets across time or sources
· Data cleaning: checking consistency, outliers, and missing values
· Renaming, recoding, and labeling variables and values
· Using codebooks and documentation to guide cleaning choices
· Identifying duplicates and unique identifiers
· Grouped calculations using egen
· Working with string variables
· Group-wise operations using by
· Local and global macros
· Looping structures (foreach, forvalues, while) with varlists and numlists
· Merging datasets and collapsing data to higher levels
· Using frames to manage multiple datasets within a project
· Accessing saved results (e.g., r(N), r(levels))
· Creating summary and results tables
· Using user-written packages appropriately
· Exporting results to Word and Excel
· Adding notes and metadata to outputs for clarity
· One-way and two-way graphs
· Layering plots to compare groups or trends
· Customization, labeling, and consistent visual styles
· Using figures to diagnose outliers, distributions, and patterns
· Reproducible graph generation and exporting
Students will conduct data analysis and visualization related to a research question of their choosing. The appropriate type of analysis will depend on the student’s academic discipline. This project entails:
· Identifying a research question of interest
· Locating and accessing at least three separate datasets that are relevant and appropriate for answering this question
· Compiling and preparing the data for analysis, demonstrating the skills learned in this course
· Conducting the analysis and preparing tables, graphs, or other outputs to convey the findings of the analysis
· Presenting the question, methods, and findings to the class
· The submission for grading should include:
o A written description of the motivation and research question
o A description of the data used, including source, collection method, contents, justification of applicability
o A description of the preparation process and analysis methods
o Polished output including tables, graphs, etc and a description of the findings
o Data sets, dofiles, and logfiles
Entire written submission should not exceed 2000 words