Let’s assume that you have information about individual people who enter and leave different facilities. Now let’s assume that you want to compute the avearge daily population for each facility in your dataset. This is a tricky problem because we only have information about which facility the person was at, when they entered and when they departed. We need to convert this individual-level data into facility-level data. This is a common problem with administrative data cleaning.

How do you compute the daily population for each facility? How do you count the number of individuals in custody at each facility for everyday?

Here is some example data.

*Example generated by -dataex-...to install: ssc install dataex
clear
input int(bookin bookout) long facility
19061 19541 75
19113 19536 36
19143 19661 49
19199 19675 72
19222 19670 36
19278 19766 57
19279 19556 36
19311 19666 85
19334 19542 49
19337 19612 15
19340 19682 34
19347 19606 87
19385 19517 33
19418 19535 57
19419 19661 30
19423 19606 36
19426 19600 1
19428 19536 36
19430 19677 72
19436 19561 112
19443 19666 85
19444 19694 55
19444 19564 112
19447 19675 74
19449 19575 49
end

format %tdnn/dd/CCYY bookin
format %tdnn/dd/CCYY bookout
label values facility facility_destrung
label def facility_destrung 1 "FAC1", modify
label def facility_destrung 15 "FAC2", modify
label def facility_destrung 30 "FAC3", modify
label def facility_destrung 33 "FAC4", modify
label def facility_destrung 34 "FAC5", modify
label def facility_destrung 36 "FAC6", modify
label def facility_destrung 49 "FAC7", modify
label def facility_destrung 55 "FAC7", modify
label def facility_destrung 57 "FAC8", modify
label def facility_destrung 72 "FAC9", modify
label def facility_destrung 74 "FAC10", modify
label def facility_destrung 75 "FAC11", modify
label def facility_destrung 85 "FAC12", modify
label def facility_destrung 87 "FAC13", modify
label def facility_destrung 112 "FAC14", modify
label var bookin "Book-in Date"
label var bookout "Book-out Date"
label var facility "Facility Number"

And here is a solution.

gen long obs_no = _n
reshape long book, i(obs_no) j(movement) string
gen change = cond(movement == "in", 1, -1)
rename book date
collapse (sum) change, by(facility date)
by facility (date), sort: gen census = sum(change)
//    IDENTIFY YEARLY AVERAGES
gen year = yofd(date)
//    ADD A 1/1 TO THE BEGINNING AND A 12/31 TO THE END OF EACH YEAR
gen expander = 1
by facility year (date), sort: replace expander = expander + 1 if _n == 1 & doy(date) != 1
by facility year (date): replace expander = expander + 1 if _n == _N & date != mdy(12, 31, year)
expand expander
by facility year (date), sort: replace date = mdy(1, 1, year) if _n == 1 & expander > 1
by facility year (date), sort: replace date = mdy(12, 31, year) if _n == _N & expander > 1
by facility year (date), sort: assert date[_N] == mdy(12, 31, year) & date[1] == mdy(1, 1, year)
//    SET CENSUS AS OF 1/1 TO CENSUS AT END OF PRECEDING YEAR, UNLESS THERE WAS AN ACTUAL
//    CENSUS CALCULATION FOR THAT DATE
by facility (date), sort: replace census = census[_n-1] if date == mdy(1, 1, year) & expander > 1 & _n > 1
//    SET CENSUS AS OF 1/1 TO 0 IF THERE WASS NO ACTUAL CENSUS CALCULATION FOR THAT DATE
//    AND THERE IS NO EARLIER YEAR'S DATA
by facility (date): replace census = 0 if date == mdy(1, 1, year) & expander > 1 & _n == 1

//    NOW WE CAN CALCULATE YEARLY AVERAGES
by facility year (date), sort: gen interval = date[_n+1] - date
by facility year (date): replace interval = 1 if date == mdy(12, 31, year)
by facility year: egen person_days = total(interval*census)
by facility year: egen days_reckoned = total(interval)
gen yearly_average_census = person_days/days_reckoned

Thanks for reading–I hope that this was helpful. I would like to thank the experts on the Statalist for help developing a solution to this problem.