SlideShare a Scribd company logo
Tim Essam (tessam@usaid.gov) • Laura Hughes (lhughes@usaid.gov) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated January 2016
Disclaimer: we are not affiliated with Stata. But we like it.
Data Processing
with Stata 14.1 Cheat Sheet
For more info see Stata’s reference manual (stata.com)
CC BY NC
frequently used
commands are
highlighted in yellow
display price[4]
display the 4th observation in price; only works on single values
levelsof rep78
display the unique values for rep78
Explore Data
duplicates report
finds all duplicate values in each variable
describe make price
display variable type, format,
and any value/variable labels
ds, has(type string)
lookfor "in."
search for variable types,
variable name, or variable label
isid mpg
check if mpg uniquely
identifies the data
plot a histogram of the
distribution of a variable
count if price > 5000
count
number of rows (observations)
Can be combined with logic
VIEW DATA ORGANIZATION
inspect mpg
show histogram of data,
number of missing or zero
observations
summarize make price mpg
print summary statistics
(mean, stdev, min, max)
for variables
codebook make price
overview of variable type, stats,
number of missing/unique values
SEE DATA DISTRIBUTION
BROWSE OBSERVATIONS WITHIN THE DATA
gsort price mpg gsort –price –mpg
sort in order, first by price then miles per gallon
(descending)(ascending)
list make price if price > 10000 & price < . clist ...
list the make and price for observations with price > $10,000
(compact form)
open the data editor
browse Ctrl 8+or
Missing values are treated as the largest
positive number. To exclude missing values,
ask whether the value is less than "."
histogram mpg, frequency
Summarize Data
bysort rep78: tabulate foreign
for each value of rep78, apply the command tabulate foreign
collapse (mean) price (max) mpg, by(foreign)
calculate mean price & max mpg by car type (foreign)
replaces data
tabstat price weight mpg, by(foreign) stat(mean sd n)
create compact table of summary statistics
table foreign, contents(mean price sd price) f(%9.2fc) row
create a flexible table of summary statistics
displays stats
for all dataformats numbers
tabulate rep78, mi gen(repairRecord)
one-way table: number of rows with each value of rep78
create binary variable for every rep78
value in a new variable, repairRecord
include missing values
tabulate rep78 foreign, mi
two-way table: cross-tabulate number of observations
for each combination of rep78 and foreign
see help egen
for more options
egen meanPrice = mean(price), by(foreign)
calculate mean price for each group in foreign
Create New Variables
pctile mpgQuartile = mpg, nq = 4
create quartiles of the mpg data
generate totRows = _N bysort rep78: gen repairTot = _N
_N creates a total count of observations (per group)
bysort rep78: gen repairIdx = _ngenerate id = _n
_n creates a running index of observations in a group
generate mpgSq = mpg^2 gen byte lowPr = price < 4000
create a new variable. Useful also for creating binary
variables based on a condition (generate byte)
Change Data Types
destring foreignString, gen(foreignNumeric)
gen foreignNumeric = real(foreignString)
1
encode foreignString, gen(foreignNumeric) "foreign"
"1"
"1"
Stata has 6 data types, and data can also be missing:
byte
true/false
int long float double
numbers
string
words
missing
no data
To convert between numbers & strings:
1
decode foreign , gen(foreignString)
tostring foreign, gen(foreignString)
gen foreignString = string(foreign)
"foreign"
"1"
"1"
recast double mpg
generic way to convert between types
if foreign != 1 & price >= 10000
make
Chevy Colt
Buick Riviera
Honda Civic
Volvo 260 1 11,995
1 4,499
0 10,372
0 3,984
foreign price
Arithmetic Logic
+
add (numbers)
combine (strings)
− subtract
* multiply
/ divide
^ raise to a power
or|
not! or ~
and&
Basic Data Operations
if foreign != 1 | price >= 10000
make
Chevy Colt
Buick Riviera
Honda Civic
Volvo 260 1 11,995
1 4,499
0 10,372
0 3,984
foreign price
> greater than
>= greater or equal to
<= less than or equal to
< less thanequal==
== tests if something is equal
= assigns a value to a variable
not
equalor
!=
~=
use "yourStataFile.dta", clear
load a dataset from the current directory
import delimited"yourFile.csv", /*
*/ rowrange(2:11) colrange(1:8) varnames(2)
import a .csv file
webuse set "https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data"
webuse "wb_indicators_long"
set web-based directory and load data from the web
import excel "yourSpreadsheet.xlsx", /*
*/ sheet("Sheet1") cellrange(A2:H11) firstrow
import an Excel spreadsheet
Import Data
sysuse auto, clear
load system data (Auto data)
for many examples, we
use the auto dataset.
pwd
print current (working) directory
cd "C:Program Files (x86)Stata13"
change working drive
dir
display filenames in working directory
fs *.dta
List all Stata files in working directory
capture log close
close the log on any existing do files
log using "myDoFile.do", replace
create a new log file to record your work and results
Set up
Basic Syntax
All Stata functions have the same format (syntax):
bysort rep78 : summarize price if foreign == 0 & price <= 9000, detail
[by varlist1:]  command  [varlist2] [=exp] [if exp] [in range] [weight] [using filename] [,options]
function: what are
you going to do
to varlists?
condition: only
apply the function
if something is true
apply to
specific rows
apply
weights
save output as
a new variable
pull data from a file
(if not loaded)
special options
for command
apply the
command across
each unique
combination of
variables in
varlist1
column to
apply
command to
In this example, we want a detailed summary
with stats like kurtosis, plus mean and median
To find out more about any command – like what options it takes – type help command
Ctrl D+
highlight text in .do file,
then ctrl + d executes it
in the command line
clear
delete data in memory
Useful Shortcuts
Ctrl 8
open the data editor
+
F2
describe data
cls clear the console (where results are displayed)
PgUp PgDn scroll through previous commands
Tab autocompletes variable name after typing part
AT COMMAND PROMPT
Ctrl 9
open a new .do file
+
search mdesc
find the package mdesc to install
ssc install mdesc
install the package mdesc; needs to be done once
packages contain
extra commands that
expand Stata’s toolkit
underlined parts
are shortcuts –
use "capture"
or "cap"
keyboard buttons
Tim Essam (tessam@usaid.gov) • Laura Hughes (lhughes@usaid.gov) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated March 2016
Disclaimer: we are not affiliated with Stata. But we like it. CC BY NC
Data Transformation
with Stata 14.1 Cheat Sheet
For more info see Stata’s reference manual (stata.com)
export delimited "myData.csv", delimiter(",") replace
export data as a comma-delimited file (.csv)
export excel "myData.xls", /*
*/ firstrow(variables) replace
export data as an Excel file (.xls) with the
variable names as the first row
Save & Export Data
save "myData.dta", replace
saveold "myData.dta", replace version(12)
save data in Stata format, replacing the data if
a file with same name exists
Stata 12-compatible file
Manipulate Strings
display trim(" leading / trailing spaces ")
remove extra spaces before and after a string
display regexr("My string", "My", "Your")
replace string1 ("My") with string2 ("Your")
display stritrim(" Too much Space")
replace consecutive spaces with a single space
display strtoname("1Var name")
convert string to Stata-compatible variable name
TRANSFORM STRINGS
display strlower("STATA should not be ALL-CAPS")
change string case; see also strupper, strproper
display strmatch("123.89", "1??.?9")
return true (1) or false (0) if string matches pattern
list make if regexm(make, "[0-9]")
list observations where make matches the regular
expression (here, records that contain a number)
FIND MATCHING STRINGS
GET STRING PROPERTIES
list if regexm(make, "(Cad.|Chev.|Datsun)")
return all observations where make contains
"Cad.", "Chev." or "Datsun"
list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun")
return all observations where the first word of the
make variable contains the listed words
compare the given list against the first word in make
charlist make
display the set of unique characters within a string
* user-defined package
replace make = subinstr(make, "Cad.", "Cadillac", 1)
replace first occurrence of "Cad." with Cadillac
in the make variable
display length("This string has 29 characters")
return the length of the string
display substr("Stata", 3, 5)
return the string located between characters 3-5
display strpos("Stata", "a")
return the position in Stata where a is first found
display real("100")
convert string to a numeric or missing value
_merge code
row only
in ind2
row only
in hh2
row in
both
1
(master)
2
(using)
3
(match)
Combine Data
ADDING (APPENDING) NEW DATA
MERGING TWO DATASETS TOGETHER
FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID
merge 1:1 id using "ind_age.dta"
one-to-one merge of "ind_age.dta"
into the loaded dataset and create
variable "_merge" to track the origin
webuse ind_age.dta, clear
save ind_age.dta, replace
webuse ind_ag.dta, clear
merge m:1 hid using "hh2.dta"
many-to-one merge of "hh2.dta"
into the loaded dataset and create
variable "_merge" to track the origin
webuse hh2.dta, clear
save hh2.dta, replace
webuse ind2.dta, clear
append using "coffeeMaize2.dta", gen(filenum)
add observations from "coffeeMaize2.dta" to
current data and create variable "filenum" to
track the origin of each observation
webuse coffeeMaize2.dta, clear
save coffeeMaize2.dta, replace
webuse coffeeMaize.dta, clear
load demo dataid blue pink
+
id blue pink
id blue pink
should
contain
the same
variables
(columns)
MANY-TO-ONE
id blue pink id brown blue pink brown _merge
3
3
1
3
2
1
3
. .
.
.
id
+ =
ONE-TO-ONE
id blue pink id brown blue pink brownid _merge
3
3
3
+ =
must contain a
common variable
(id)
match records from different data sets using probabilistic matchingreclink
create distance measure for similarity between two strings
ssc install reclink
ssc install jarowinklerjarowinkler
Reshape Data
webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data
webuse "coffeeMaize.dta" load demo dataset
xpose, clear varname
transpose rows and columns of data, clearing the data and saving
old column names as a new variable called "_varname"
MELT DATA (WIDE → LONG)
reshape long coffee@ maize@, i(country) j(year)
convert a wide dataset to long
reshape variables starting
with coffee and maize
unique id
variable (key)
create new variable which captures
the info in the column names
CAST DATA (LONG → WIDE)
reshape wide coffee maize, i(country) j(year)
convert a long dataset to wide
create new variables named
coffee2011, maize2012...
what will be
unique id
variable (key)
create new variables
with the year added
to the column name
When datasets are
tidy, they have a
c o n s i s t e n t ,
standard format
that is easier to
manipulate and
analyze.
country
coffee
2011
coffee
2012
maize
2011
maize
2012
Malawi
Rwanda
Uganda cast
melt
Rwanda
Uganda
Malawi
Malawi
Rwanda
Uganda 2012
2011
2011
2012
2011
2012
year coffee maizecountry
WIDE LONG (TIDY) TIDY DATASETS have
each observation
in its own row and
each variable in its
own column.
new variable
Label Data
label list
list all labels within the dataset
label define myLabel 0 "US" 1 "Not US"
label values foreign myLabel
define a label and apply it the values in foreign
Value labels map string descriptions to numers. They allow the
underlying data to be numeric (making logical tests simpler)
while also connecting the values to human-understandable text.
Replace Parts of Data
rename (rep78 foreign) (repairRecord carType)
rename one or multiple variables
CHANGE COLUMN NAMES
recode price (0 / 5000 = 5000)
change all prices less than 5000 to be $5,000
recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2)
change the values and value labels then store in a new
variable, foreign2
CHANGE ROW VALUES
useful for exporting datamvencode _all, mv(9999)
replace missing values with the number 9999 for all variables
mvdecode _all, mv(9999)
replace the number 9999 with missing value in all variables
useful for cleaning survey datasets
REPLACE MISSING VALUES
replace price = 5000 if price < 5000
replace all values of price that are less than $5,000 with 5000
Select Parts of Data (Subsetting)
FILTER SPECIFIC ROWS
drop in 1/4drop if mpg < 20
drop observations based on a condition (left)
or rows 1-4 (right)
keep in 1/30
opposite of drop; keep only rows 1-30
keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru")
keep the specified values of make
keep if inrange(price, 5000, 10000)
keep values of price between $5,000 – $10,000 (inclusive)
sample 25
sample 25% of the observations in the dataset
(use set seed # command for reproducible sampling)
SELECT SPECIFIC COLUMNS
drop make
remove the 'make' variable
keep make price
opposite of drop; keep only columns 'make' and 'price'

More Related Content

PDF
Stata Programming Cheat Sheet
PDF
Stata cheat sheet: data transformation
PDF
Stata cheatsheet transformation
PDF
Stata cheat sheet analysis
PDF
Stata Cheat Sheets (all)
PDF
Data transformation-cheatsheet
PDF
PDF
Data import-cheatsheet
Stata Programming Cheat Sheet
Stata cheat sheet: data transformation
Stata cheatsheet transformation
Stata cheat sheet analysis
Stata Cheat Sheets (all)
Data transformation-cheatsheet
Data import-cheatsheet

What's hot (19)

PDF
4 R Tutorial DPLYR Apply Function
PPTX
R language introduction
PPTX
Big Data Mining in Indian Economic Survey 2017
PDF
Data manipulation on r
PDF
3 R Tutorial Data Structure
PDF
Data handling in r
PDF
Data manipulation with dplyr
ODP
PDF
R grĂĄfico
PPTX
R Language Introduction
PDF
R Programming: Importing Data In R
PDF
Morel, a Functional Query Language
PDF
Grouping & Summarizing Data in R
PDF
R Programming: Export/Output Data In R
PDF
R Programming: Learn To Manipulate Strings In R
PDF
R code for data manipulation
PDF
Export Data using R Studio
PPTX
Data Management in Python
4 R Tutorial DPLYR Apply Function
R language introduction
Big Data Mining in Indian Economic Survey 2017
Data manipulation on r
3 R Tutorial Data Structure
Data handling in r
Data manipulation with dplyr
R grĂĄfico
R Language Introduction
R Programming: Importing Data In R
Morel, a Functional Query Language
Grouping & Summarizing Data in R
R Programming: Export/Output Data In R
R Programming: Learn To Manipulate Strings In R
R code for data manipulation
Export Data using R Studio
Data Management in Python
Ad

Viewers also liked (9)

PDF
Stata cheat sheet: data visualization
PPTX
STATA - Probit Analysis
PDF
Market Participation Impacts of Improved Wheat Varieties in Ethiopia: Applic...
 
PDF
GonzalezZaira_WritingSample
PPT
Serce Stata Sfo Roy Costilla Final
PDF
UNDP_GEF_SGP_Project_Impact_Evaluation_Research_Application of the Propensity...
PDF
Stata cheat sheet: Data visualization
PPTX
STATA - Instrumental Variables
PDF
A practitioners guide to stochastic frontier analysis using stata-kumbhakar
Stata cheat sheet: data visualization
STATA - Probit Analysis
Market Participation Impacts of Improved Wheat Varieties in Ethiopia: Applic...
 
GonzalezZaira_WritingSample
Serce Stata Sfo Roy Costilla Final
UNDP_GEF_SGP_Project_Impact_Evaluation_Research_Application of the Propensity...
Stata cheat sheet: Data visualization
STATA - Instrumental Variables
A practitioners guide to stochastic frontier analysis using stata-kumbhakar
Ad

Similar to Stata cheat sheet: data processing (20)

PDF
Cheat Sheet for Stata v15.00 PDF Complete
PPTX
Statistics Linear Regression Model by Maqsood Asalam
PPTX
Exploratory data analysis in R - Data Science Club
PDF
Introduction to STATA - Ali Rashed
PPTX
Introduction to R
PPTX
Stata Python Rosetta Stone Side-by-side code examples
PPTX
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
PDF
Stata cheatsheet programming
PPT
Introduction to Stata
DOCX
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19 156 Page .docx
PPTX
Data manipulation and visualization in r 20190711 myanmarucsy
PDF
Broom: Converting Statistical Models to Tidy Data Frames
PPTX
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
PDF
R programming & Machine Learning
PPTX
INTRODUCTION TO STATA.pptx
PDF
Stata tutorial
PDF
R Programming: Transform/Reshape Data In R
PDF
Practical Data Science : Data Cleaning and Summarising
PDF
StataTutorial.pdf
PPT
Stata Training_EEA.ppt
Cheat Sheet for Stata v15.00 PDF Complete
Statistics Linear Regression Model by Maqsood Asalam
Exploratory data analysis in R - Data Science Club
Introduction to STATA - Ali Rashed
Introduction to R
Stata Python Rosetta Stone Side-by-side code examples
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
Stata cheatsheet programming
Introduction to Stata
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19 156 Page .docx
Data manipulation and visualization in r 20190711 myanmarucsy
Broom: Converting Statistical Models to Tidy Data Frames
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
R programming & Machine Learning
INTRODUCTION TO STATA.pptx
Stata tutorial
R Programming: Transform/Reshape Data In R
Practical Data Science : Data Cleaning and Summarising
StataTutorial.pdf
Stata Training_EEA.ppt

Recently uploaded (20)

PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Understanding Prototyping in Design and Development
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Data Science Trends & Career Guide---ppt
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
climate analysis of Dhaka ,Banglades.pptx
Understanding Prototyping in Design and Development
Moving the Public Sector (Government) to a Digital Adoption
Data Science Trends & Career Guide---ppt
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Major-Components-ofNKJNNKNKNKNKronment.pptx
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)

Stata cheat sheet: data processing

  • 1. Tim Essam ([email protected]) • Laura Hughes ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated January 2016 Disclaimer: we are not affiliated with Stata. But we like it. Data Processing with Stata 14.1 Cheat Sheet For more info see Stata’s reference manual (stata.com) CC BY NC frequently used commands are highlighted in yellow display price[4] display the 4th observation in price; only works on single values levelsof rep78 display the unique values for rep78 Explore Data duplicates report finds all duplicate values in each variable describe make price display variable type, format, and any value/variable labels ds, has(type string) lookfor "in." search for variable types, variable name, or variable label isid mpg check if mpg uniquely identifies the data plot a histogram of the distribution of a variable count if price > 5000 count number of rows (observations) Can be combined with logic VIEW DATA ORGANIZATION inspect mpg show histogram of data, number of missing or zero observations summarize make price mpg print summary statistics (mean, stdev, min, max) for variables codebook make price overview of variable type, stats, number of missing/unique values SEE DATA DISTRIBUTION BROWSE OBSERVATIONS WITHIN THE DATA gsort price mpg gsort –price –mpg sort in order, first by price then miles per gallon (descending)(ascending) list make price if price > 10000 & price < . clist ... list the make and price for observations with price > $10,000 (compact form) open the data editor browse Ctrl 8+or Missing values are treated as the largest positive number. To exclude missing values, ask whether the value is less than "." histogram mpg, frequency Summarize Data bysort rep78: tabulate foreign for each value of rep78, apply the command tabulate foreign collapse (mean) price (max) mpg, by(foreign) calculate mean price & max mpg by car type (foreign) replaces data tabstat price weight mpg, by(foreign) stat(mean sd n) create compact table of summary statistics table foreign, contents(mean price sd price) f(%9.2fc) row create a flexible table of summary statistics displays stats for all dataformats numbers tabulate rep78, mi gen(repairRecord) one-way table: number of rows with each value of rep78 create binary variable for every rep78 value in a new variable, repairRecord include missing values tabulate rep78 foreign, mi two-way table: cross-tabulate number of observations for each combination of rep78 and foreign see help egen for more options egen meanPrice = mean(price), by(foreign) calculate mean price for each group in foreign Create New Variables pctile mpgQuartile = mpg, nq = 4 create quartiles of the mpg data generate totRows = _N bysort rep78: gen repairTot = _N _N creates a total count of observations (per group) bysort rep78: gen repairIdx = _ngenerate id = _n _n creates a running index of observations in a group generate mpgSq = mpg^2 gen byte lowPr = price < 4000 create a new variable. Useful also for creating binary variables based on a condition (generate byte) Change Data Types destring foreignString, gen(foreignNumeric) gen foreignNumeric = real(foreignString) 1 encode foreignString, gen(foreignNumeric) "foreign" "1" "1" Stata has 6 data types, and data can also be missing: byte true/false int long float double numbers string words missing no data To convert between numbers & strings: 1 decode foreign , gen(foreignString) tostring foreign, gen(foreignString) gen foreignString = string(foreign) "foreign" "1" "1" recast double mpg generic way to convert between types if foreign != 1 & price >= 10000 make Chevy Colt Buick Riviera Honda Civic Volvo 260 1 11,995 1 4,499 0 10,372 0 3,984 foreign price Arithmetic Logic + add (numbers) combine (strings) − subtract * multiply / divide ^ raise to a power or| not! or ~ and& Basic Data Operations if foreign != 1 | price >= 10000 make Chevy Colt Buick Riviera Honda Civic Volvo 260 1 11,995 1 4,499 0 10,372 0 3,984 foreign price > greater than >= greater or equal to <= less than or equal to < less thanequal== == tests if something is equal = assigns a value to a variable not equalor != ~= use "yourStataFile.dta", clear load a dataset from the current directory import delimited"yourFile.csv", /* */ rowrange(2:11) colrange(1:8) varnames(2) import a .csv file webuse set "https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data" webuse "wb_indicators_long" set web-based directory and load data from the web import excel "yourSpreadsheet.xlsx", /* */ sheet("Sheet1") cellrange(A2:H11) firstrow import an Excel spreadsheet Import Data sysuse auto, clear load system data (Auto data) for many examples, we use the auto dataset. pwd print current (working) directory cd "C:Program Files (x86)Stata13" change working drive dir display filenames in working directory fs *.dta List all Stata files in working directory capture log close close the log on any existing do files log using "myDoFile.do", replace create a new log file to record your work and results Set up Basic Syntax All Stata functions have the same format (syntax): bysort rep78 : summarize price if foreign == 0 & price <= 9000, detail [by varlist1:]  command  [varlist2] [=exp] [if exp] [in range] [weight] [using filename] [,options] function: what are you going to do to varlists? condition: only apply the function if something is true apply to specific rows apply weights save output as a new variable pull data from a file (if not loaded) special options for command apply the command across each unique combination of variables in varlist1 column to apply command to In this example, we want a detailed summary with stats like kurtosis, plus mean and median To find out more about any command – like what options it takes – type help command Ctrl D+ highlight text in .do file, then ctrl + d executes it in the command line clear delete data in memory Useful Shortcuts Ctrl 8 open the data editor + F2 describe data cls clear the console (where results are displayed) PgUp PgDn scroll through previous commands Tab autocompletes variable name after typing part AT COMMAND PROMPT Ctrl 9 open a new .do file + search mdesc find the package mdesc to install ssc install mdesc install the package mdesc; needs to be done once packages contain extra commands that expand Stata’s toolkit underlined parts are shortcuts – use "capture" or "cap" keyboard buttons
  • 2. Tim Essam ([email protected]) • Laura Hughes ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated March 2016 Disclaimer: we are not affiliated with Stata. But we like it. CC BY NC Data Transformation with Stata 14.1 Cheat Sheet For more info see Stata’s reference manual (stata.com) export delimited "myData.csv", delimiter(",") replace export data as a comma-delimited file (.csv) export excel "myData.xls", /* */ firstrow(variables) replace export data as an Excel file (.xls) with the variable names as the first row Save & Export Data save "myData.dta", replace saveold "myData.dta", replace version(12) save data in Stata format, replacing the data if a file with same name exists Stata 12-compatible file Manipulate Strings display trim(" leading / trailing spaces ") remove extra spaces before and after a string display regexr("My string", "My", "Your") replace string1 ("My") with string2 ("Your") display stritrim(" Too much Space") replace consecutive spaces with a single space display strtoname("1Var name") convert string to Stata-compatible variable name TRANSFORM STRINGS display strlower("STATA should not be ALL-CAPS") change string case; see also strupper, strproper display strmatch("123.89", "1??.?9") return true (1) or false (0) if string matches pattern list make if regexm(make, "[0-9]") list observations where make matches the regular expression (here, records that contain a number) FIND MATCHING STRINGS GET STRING PROPERTIES list if regexm(make, "(Cad.|Chev.|Datsun)") return all observations where make contains "Cad.", "Chev." or "Datsun" list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun") return all observations where the first word of the make variable contains the listed words compare the given list against the first word in make charlist make display the set of unique characters within a string * user-defined package replace make = subinstr(make, "Cad.", "Cadillac", 1) replace first occurrence of "Cad." with Cadillac in the make variable display length("This string has 29 characters") return the length of the string display substr("Stata", 3, 5) return the string located between characters 3-5 display strpos("Stata", "a") return the position in Stata where a is first found display real("100") convert string to a numeric or missing value _merge code row only in ind2 row only in hh2 row in both 1 (master) 2 (using) 3 (match) Combine Data ADDING (APPENDING) NEW DATA MERGING TWO DATASETS TOGETHER FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID merge 1:1 id using "ind_age.dta" one-to-one merge of "ind_age.dta" into the loaded dataset and create variable "_merge" to track the origin webuse ind_age.dta, clear save ind_age.dta, replace webuse ind_ag.dta, clear merge m:1 hid using "hh2.dta" many-to-one merge of "hh2.dta" into the loaded dataset and create variable "_merge" to track the origin webuse hh2.dta, clear save hh2.dta, replace webuse ind2.dta, clear append using "coffeeMaize2.dta", gen(filenum) add observations from "coffeeMaize2.dta" to current data and create variable "filenum" to track the origin of each observation webuse coffeeMaize2.dta, clear save coffeeMaize2.dta, replace webuse coffeeMaize.dta, clear load demo dataid blue pink + id blue pink id blue pink should contain the same variables (columns) MANY-TO-ONE id blue pink id brown blue pink brown _merge 3 3 1 3 2 1 3 . . . . id + = ONE-TO-ONE id blue pink id brown blue pink brownid _merge 3 3 3 + = must contain a common variable (id) match records from different data sets using probabilistic matchingreclink create distance measure for similarity between two strings ssc install reclink ssc install jarowinklerjarowinkler Reshape Data webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data webuse "coffeeMaize.dta" load demo dataset xpose, clear varname transpose rows and columns of data, clearing the data and saving old column names as a new variable called "_varname" MELT DATA (WIDE → LONG) reshape long coffee@ maize@, i(country) j(year) convert a wide dataset to long reshape variables starting with coffee and maize unique id variable (key) create new variable which captures the info in the column names CAST DATA (LONG → WIDE) reshape wide coffee maize, i(country) j(year) convert a long dataset to wide create new variables named coffee2011, maize2012... what will be unique id variable (key) create new variables with the year added to the column name When datasets are tidy, they have a c o n s i s t e n t , standard format that is easier to manipulate and analyze. country coffee 2011 coffee 2012 maize 2011 maize 2012 Malawi Rwanda Uganda cast melt Rwanda Uganda Malawi Malawi Rwanda Uganda 2012 2011 2011 2012 2011 2012 year coffee maizecountry WIDE LONG (TIDY) TIDY DATASETS have each observation in its own row and each variable in its own column. new variable Label Data label list list all labels within the dataset label define myLabel 0 "US" 1 "Not US" label values foreign myLabel define a label and apply it the values in foreign Value labels map string descriptions to numers. They allow the underlying data to be numeric (making logical tests simpler) while also connecting the values to human-understandable text. Replace Parts of Data rename (rep78 foreign) (repairRecord carType) rename one or multiple variables CHANGE COLUMN NAMES recode price (0 / 5000 = 5000) change all prices less than 5000 to be $5,000 recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2) change the values and value labels then store in a new variable, foreign2 CHANGE ROW VALUES useful for exporting datamvencode _all, mv(9999) replace missing values with the number 9999 for all variables mvdecode _all, mv(9999) replace the number 9999 with missing value in all variables useful for cleaning survey datasets REPLACE MISSING VALUES replace price = 5000 if price < 5000 replace all values of price that are less than $5,000 with 5000 Select Parts of Data (Subsetting) FILTER SPECIFIC ROWS drop in 1/4drop if mpg < 20 drop observations based on a condition (left) or rows 1-4 (right) keep in 1/30 opposite of drop; keep only rows 1-30 keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru") keep the specified values of make keep if inrange(price, 5000, 10000) keep values of price between $5,000 – $10,000 (inclusive) sample 25 sample 25% of the observations in the dataset (use set seed # command for reproducible sampling) SELECT SPECIFIC COLUMNS drop make remove the 'make' variable keep make price opposite of drop; keep only columns 'make' and 'price'