El uso de este paquete se mostrara a partir de los datos de ejemplo. Vamos a leer un subconjunto de la encuesta CASEN que corresponde a las comunas con representatividad comunal en la Region de los Rios. Estos datos se incluyen, a modo de demostracion, con el paquete casen, no asi las encuestas completas que deben ser descargadas y se provee una funcion para hacerlo.
library(casen)
casen_2017_los_rios
## # A tibble: 8,055 × 7
## expc varstrat varunit ytotcorh sexo comuna pobreza
## <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
## 1 53 141283 14128301184 959765 2 [Mujer] 14101 [Valdivia] 3 [No pobres]
## 2 53 141283 14128301184 959765 2 [Mujer] 14101 [Valdivia] 3 [No pobres]
## 3 53 141283 14128301184 959765 2 [Mujer] 14101 [Valdivia] 3 [No pobres]
## 4 105 141283 14128301184 283273 2 [Mujer] 14101 [Valdivia] 2 [Pobres no…
## 5 105 141283 14128301184 283273 1 [Hombre] 14101 [Valdivia] 2 [Pobres no…
## 6 105 141283 14128301184 283273 2 [Mujer] 14101 [Valdivia] 2 [Pobres no…
## 7 105 141283 14128301184 353083 2 [Mujer] 14101 [Valdivia] 3 [No pobres]
## 8 105 141283 14128301184 260000 2 [Mujer] 14101 [Valdivia] 2 [Pobres no…
## 9 105 141283 14128301184 260000 2 [Mujer] 14101 [Valdivia] 2 [Pobres no…
## 10 105 141283 14128301184 260000 1 [Hombre] 14101 [Valdivia] 2 [Pobres no…
## # … with 8,045 more rows
## # ℹ Use `print(n = ...)` to see more rows
Para descargar la encuesta CASEN es posible hacerlo desde el sitio web del Ministerio de Desarrollo Social (oficial) o desde GitHub (no oficial).
No se daran ejemplos con la encuesta casen completa ya que CRAN limita el tiempo en segundos que puede tomar un ejemplo.
# todos los anios disponibles en carpeta casen-formato-stata
descargar_casen_mds("casen-formato-stata")
# solo anio 2017 en carpeta casen-formato-stata
descargar_casen_mds(2017, "casen-formato-stata")
# leer encuesta CASEN en formato DTA (Stata)
# (se debe descomprimir con winRAR u otro)
library(haven)
read_stata("casen-formato-stata/Casen 2017.dta")
En la misma linea de la seccion anterior, es posible descargar la encuesta CASEN en formato R.
# todos los anios disponibles
descargar_casen_github("casen-formato-r")
# solo anio 2017
descargar_casen_github(2017, "casen-formato-r")
# leer encuesta CASEN en formato RDS (R)
readRDS("casen-formato-r/2017.rds")
El valor agregado del paquete CASEN es que entrega una forma facil de hacer calculos tomando en cuenta, ademas del disenio complejo, los grados de libertad.
En el contexto de muestreo bietapico, el calculo de los grados de libertad es un tema no resuelto, pero el paquete casen utiliza la siguiente definicion del INE: “Los grados de libertad corresponden a la diferencia entre el numero de conglomerados no vacios (con presencia de la variable de interes) y el numero de estratos no vacios”.
En dominios de estimacion pequenios la perdida de grados de libertad se vuelve bastante relevante. Esto es muy importante cuando se hacen calculos agrupando por region, sexo, region y sexo, entre otras.
Al usar la funcion svyby
del paquete survey, la funcion
no calcula correctamente los grados de libertad de cada dominio, sino
que considera un valor fijo para todos los dominios. Las funciones del
paquete casen usan esta funcion pero incluyen una serie de pasos
programados para usar los grados de libertad adecuados para cada dominio
de estimacion.
# disenio complejo a partir de los datos de ejemplo
<- configuracion_disenio(casen_2017_los_rios, "ytotcorh", c("comuna", "sexo"), "expc")
cd $disenio cd
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (67) clusters.
## Called via srvyr
## Sampling variables:
## - ids: varunit
## - strata: varstrat
## - weights: expc
## Data variables: ytotcorh (dbl), comuna (dbl+lbl), sexo (dbl+lbl), expc (dbl),
## varunit (dbl), varstrat (dbl)
$grupos cd
## # A tibble: 12 × 2
## comuna sexo
## <dbl+lbl> <dbl+lbl>
## 1 14101 [Valdivia] 2 [Mujer]
## 2 14101 [Valdivia] 1 [Hombre]
## 3 14104 [Los Lagos] 1 [Hombre]
## 4 14104 [Los Lagos] 2 [Mujer]
## 5 14107 [Paillaco] 1 [Hombre]
## 6 14107 [Paillaco] 2 [Mujer]
## 7 14108 [Panguipulli] 2 [Mujer]
## 8 14108 [Panguipulli] 1 [Hombre]
## 9 14201 [La Unión] 1 [Hombre]
## 10 14201 [La Unión] 2 [Mujer]
## 11 14204 [Río Bueno] 1 [Hombre]
## 12 14204 [Río Bueno] 2 [Mujer]
La funcion estadistica_descriptiva()
toma en cuenta el
disenio complejo y los grados de libertad. Puedo agrupar por una o mas
variables y la funcion entrega, ademas del promedio, el intervalo de
confianza.
# media, mediana y percentil 70
media_agrupada(cd)
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## • Column `comuna`
## # A tibble: 12 × 8
## # Groups: comuna_etiqueta [6]
## comuna_etiq…¹ sexo_…² comuna_co…³ sexo_…⁴ comuna media…⁵ media…⁶ media…⁷
## <dbl+lbl> <chr> <dbl+lbl> <chr> <dbl+lbl> <dbl> <dbl> <dbl>
## 1 14101 [Valdi… Mujer 14101 [Val… 2 14101 [Val… 1.26e6 1.03e6 1.50e6
## 2 14101 [Valdi… Hombre 14101 [Val… 1 14101 [Val… 1.35e6 1.07e6 1.62e6
## 3 14104 [Los L… Hombre 14104 [Los… 1 14104 [Los… 7.19e5 6.39e5 7.98e5
## 4 14104 [Los L… Mujer 14104 [Los… 2 14104 [Los… 7.31e5 6.40e5 8.21e5
## 5 14107 [Paill… Hombre 14107 [Pai… 1 14107 [Pai… 7.08e5 4.23e5 9.92e5
## 6 14107 [Paill… Mujer 14107 [Pai… 2 14107 [Pai… 6.89e5 4.52e5 9.26e5
## 7 14108 [Pangu… Mujer 14108 [Pan… 2 14108 [Pan… 6.31e5 5.38e5 7.24e5
## 8 14108 [Pangu… Hombre 14108 [Pan… 1 14108 [Pan… 6.85e5 6.29e5 7.41e5
## 9 14201 [La Un… Hombre 14201 [La … 1 14201 [La … 8.05e5 7.07e5 9.02e5
## 10 14201 [La Un… Mujer 14201 [La … 2 14201 [La … 7.52e5 6.64e5 8.39e5
## 11 14204 [Río B… Hombre 14204 [Río… 1 14204 [Río… 7.73e5 6.61e5 8.84e5
## 12 14204 [Río B… Mujer 14204 [Río… 2 14204 [Río… 7.14e5 6.08e5 8.20e5
## # … with abbreviated variable names ¹comuna_etiqueta, ²sexo_etiqueta,
## # ³comuna_codigo, ⁴sexo_codigo, ⁵media_ytotcorh, ⁶media_ytotcorh_inf,
## # ⁷media_ytotcorh_sup
mediana_agrupada(cd)
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## • Column `comuna`
## # A tibble: 12 × 8
## # Groups: comuna_etiqueta [6]
## comuna_etiq…¹ sexo_…² comuna_co…³ sexo_…⁴ comuna media…⁵ media…⁶ media…⁷
## <dbl+lbl> <chr> <dbl+lbl> <chr> <dbl+lbl> <dbl> <dbl> <dbl>
## 1 14101 [Valdi… Mujer 14101 [Val… 2 14101 [Val… 900000 794484 1005866
## 2 14101 [Valdi… Hombre 14101 [Val… 1 14101 [Val… 914646 810887 1047226
## 3 14104 [Los L… Hombre 14104 [Los… 1 14104 [Los… 613990 579550 645000
## 4 14104 [Los L… Mujer 14104 [Los… 2 14104 [Los… 635936 582273 665000
## 5 14107 [Paill… Hombre 14107 [Pai… 1 14107 [Pai… 553684 450000 710646
## 6 14107 [Paill… Mujer 14107 [Pai… 2 14107 [Pai… 532732 393687 996667
## 7 14108 [Pangu… Mujer 14108 [Pan… 2 14108 [Pan… 532472 531391 552156
## 8 14108 [Pangu… Hombre 14108 [Pan… 1 14108 [Pan… 560000 531391 632064
## 9 14201 [La Un… Hombre 14201 [La … 1 14201 [La … 624669 564167 699167
## 10 14201 [La Un… Mujer 14201 [La … 2 14201 [La … 573938 522321 698745
## 11 14204 [Río B… Hombre 14204 [Río… 1 14204 [Río… 707156 577642 757785
## 12 14204 [Río B… Mujer 14204 [Río… 2 14204 [Río… 631806 513536 730000
## # … with abbreviated variable names ¹comuna_etiqueta, ²sexo_etiqueta,
## # ³comuna_codigo, ⁴sexo_codigo, ⁵mediana_ytotcorh, ⁶mediana_ytotcorh_inf,
## # ⁷mediana_ytotcorh_sup
percentiles_agrupados(cd)
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## `mutate_if()` ignored the following grouping variables:
## • Column `comuna`
## # A tibble: 12 × 8
## # Groups: comuna_etiqueta [6]
## percentil comuna_et…¹ sexo_…² comuna_co…³ sexo_…⁴ comuna media…⁵ media…⁶
## <dbl> <dbl+lbl> <chr> <dbl+lbl> <chr> <dbl+lbl> <dbl> <dbl>
## 1 0.7 14101 [Val… Mujer 14101 [Val… 2 14101 [Val… 1268480 101816.
## 2 0.7 14101 [Val… Hombre 14101 [Val… 1 14101 [Val… 1300000 100874.
## 3 0.7 14104 [Los… Hombre 14104 [Los… 1 14104 [Los… 853162 25455.
## 4 0.7 14104 [Los… Mujer 14104 [Los… 2 14104 [Los… 853162 50598.
## 5 0.7 14107 [Pai… Hombre 14107 [Pai… 1 14107 [Pai… 796676 215415.
## 6 0.7 14107 [Pai… Mujer 14107 [Pai… 2 14107 [Pai… 792246 224112.
## 7 0.7 14108 [Pan… Mujer 14108 [Pan… 2 14108 [Pan… 700000 57176.
## 8 0.7 14108 [Pan… Hombre 14108 [Pan… 1 14108 [Pan… 784167 40130.
## 9 0.7 14201 [La … Hombre 14201 [La … 1 14201 [La … 821331 24642.
## 10 0.7 14201 [La … Mujer 14201 [La … 2 14201 [La … 786368 58745.
## 11 0.7 14204 [Río… Hombre 14204 [Río… 1 14204 [Río… 836333 42714.
## 12 0.7 14204 [Río… Mujer 14204 [Río… 2 14204 [Río… 769578 56308.
## # … with abbreviated variable names ¹comuna_etiqueta, ²sexo_etiqueta,
## # ³comuna_codigo, ⁴sexo_codigo, ⁵mediana_ytotcorh, ⁶mediana_ytotcorh_err_est
Las funciones de casen
estan pensadas para usarse en
conjunto con las de dplyr
. Un ejemplo simple de esto es el
calculo del porcentaje de hogares pobres.
Veamos el caso puntual de las funciones mutate
y
filter
de dplyr.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# convierto pobreza a una variable binaria
%>%
casen_2017_los_rios mutate(pobreza = ifelse(pobreza <= 2, 1, 0)) %>%
configuracion_disenio("pobreza", "comuna", "expc") %>%
media_agrupada()
## # A tibble: 6 × 5
## comuna_etiqueta comuna_codigo media_pobreza media_pobreza_inf media_pobreza_…¹
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Valdivia 14101 0.0753 0.0513 0.0993
## 2 Los Lagos 14104 0.156 0.104 0.209
## 3 Paillaco 14107 0.209 -0.0754 0.494
## 4 Panguipulli 14108 0.133 0.0928 0.173
## 5 La Unión 14201 0.132 0.0656 0.198
## 6 Río Bueno 14204 0.170 0.0736 0.266
## # … with abbreviated variable name ¹media_pobreza_sup
El paquete casen provee dos funciones para sacar el maximo partido a las funciones de R tomando en cuenta el disenio complejo y los grados de libertad.
Ajuste del modelo:
# modelo: ytotcorh = b0 + b1 comuna + b2 sexo + e
<- modelo_lineal_generalizado(cd, "ytotcorh ~ comuna + sexo")
mod summary(mod)
##
## Call:
## svyglm(formula = stats::as.formula(modelo), design = disenio$disenio)
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55655084.7 13884444.7 4.008 0.000192 ***
## comuna -3860.7 977.4 -3.950 0.000232 ***
## sexo -58713.5 23514.9 -2.497 0.015672 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.491064e+12)
##
## Number of Fisher Scoring iterations: 2
Ademas de dplyr, casen se integra perfectamente con los paquetes del Tidyverse y otros paquetes afines.
Por ejemplo, el paquete broom permite obtener una tabla con los intervalos de confianza para los betas estimados del modelo.
library(broom)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# usamos ddf y degf del paquete survey para hacer el mismo calculo
# que realiza Stata
<- confint_tidy(mod, ddf = degf(cd$disenio)) mod_conf
## Warning: confint_tidy is now deprecated and will be removed from a future
## release of broom. Please use the applicable confint method.
mod_conf
## # A tibble: 3 × 2
## conf.low conf.high
## <dbl> <dbl>
## 1 27830036. 83480134.
## 2 -5820. -1902.
## 3 -105838. -11589.
Lo anterior se puede ordenar y presentar los betas junto con los intervalos de confianza.
# ordenamos la salida del modelo
<- tidy(mod)
mod_betas
# pegamos las columnas
<- bind_cols(mod_betas, mod_conf)
mod_betas
# ordenamos los nombres de las columnas
<- clean_names(mod_betas)
mod_betas
mod_betas
## # A tibble: 3 × 7
## term estimate std_error statistic p_value conf_low conf_high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 55655085. 13884445. 4.01 0.000192 27830036. 83480134.
## 2 comuna -3861. 977. -3.95 0.000232 -5820. -1902.
## 3 sexo -58714. 23515. -2.50 0.0157 -105838. -11589.
El paquete casen provee un conjunto de datos con la correspondencia de los codigos de comunas.
Veamos como convertir los codigos de la encuesta CASEN 1990 a codigos SUBDERE 2017.
%>%
codigos_casen filter(valido_desde == 1990)
## # A tibble: 323 × 4
## # Groups: codigo_casen, codigo_subdere_2017 [323]
## valido_desde valido_hasta codigo_casen codigo_subdere_2017
## <dbl> <dbl> <int> <chr>
## 1 1990 2017 11201 11201
## 2 1990 2003 5605 05602
## 3 1990 2003 13504 13502
## 4 1990 2003 3304 03302
## 5 1990 2003 1307 01107
## 6 1990 2003 10402 10202
## 7 1990 2003 4104 04103
## 8 1990 2003 9101 09201
## 9 1990 2003 2301 02101
## 10 1990 2003 8204 08302
## # … with 313 more rows
## # ℹ Use `print(n = ...)` to see more rows
<- read_sav("casen-formato-spss/1990.sav") %>%
casen1990 mutate(comu = as.integer(comu)) %>%
left_join(
%>% filter(valido_desde == 1990) %>% select(starts_with("codigo")),
codigos_casen by = c("comu" = "codigo_casen")
)