Package 'statar' reference manual

Title:	Tools Inspired by 'Stata' to Manipulate Tabular Data
Description:	A set of tools inspired by 'Stata' to explore data.frames ('summarize', 'tabulate', 'xtile', 'pctile', 'binscatter', elapsed quarters/month, lead/lag).
Authors:	Matthieu Gomez [aut, cre]
Maintainer:	Matthieu Gomez <[email protected]>
License:	GPL-2
Version:	0.7.6
Built:	2025-03-11 03:30:23 UTC
Source:	https://github.com/matthieugomez/statar

Elapsed dates (monthly, quarterly)

Description

Elapsed dates (monthly, quarterly)

Usage

as.quarterly(x)

is.quarterly(x)

as.monthly(x)

is.monthly(x)
as.quarterly(x)

is.quarterly(x)

as.monthly(x)

is.monthly(x)

Arguments

x

a vector

Details

Monthly and quarterly dates are stored as integers, representing the number of elapsed calendar periods since 01/01/1970. As yearmonth and yearqtr the package zoo, these dates are printed in a way that fits their frequency (YYYqq, YYYmMM). The only difference is that, monthly, and quarterly are integers, which removes issues due to floating points (particularly important when merging). This also allows to use arithmetic on perios, ie date + 1 adds one period rather than one day.

Methods to convert from and to Dates or POSIXlt are provided. In particular, you may use lubridate week month and year to extract information from elapsed dates.

Examples

library(lubridate)
library(dplyr)
date <- mdy(c("04/03/1992", "01/04/1992", "03/15/1992"))  
datem <- as.monthly(date)
is.monthly(datem)
as.quarterly(date)
as.character(datem)
datem + 1
df <- tibble(datem)
# filter(df, month(datem) == 1)
seq(datem[1], datem[2])
as.Date(datem)
as.POSIXlt(datem)
as.POSIXct(datem)
week(datem)
library(lubridate)
library(dplyr)
date <- mdy(c("04/03/1992", "01/04/1992", "03/15/1992"))  
datem <- as.monthly(date)
is.monthly(datem)
as.quarterly(date)
as.character(datem)
datem + 1
df <- tibble(datem)
# filter(df, month(datem) == 1)
seq(datem[1], datem[2])
as.Date(datem)
as.POSIXlt(datem)
as.POSIXct(datem)
week(datem)

Add rows corresponding to gaps in some variable

Description

Add rows corresponding to gaps in some variable

Usage

fill_gap(
  x,
  ...,
  full = FALSE,
  roll = FALSE,
  rollends = if (roll == "nearest") c(TRUE, TRUE) else if (roll >= 0) c(FALSE, TRUE) else
    c(TRUE, FALSE)
)
fill_gap(
  x,
  ...,
  full = FALSE,
  roll = FALSE,
  rollends = if (roll == "nearest") c(TRUE, TRUE) else if (roll >= 0) c(FALSE, TRUE) else
    c(TRUE, FALSE)
)

Arguments

`x`	A data frame
`...`	a time variable
`full`	A boolean. When full = FALSE (default), the function creates rows corresponding to all missing times between the min and max of `...` within each group. When full = TRUE, the function creates rows corresponding to all missing times between the min and max of `...` in the whole dataset.
`roll`	When roll is a positive number, values are carried forward. roll=TRUE is equivalent to roll=+Inf. When roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf for unlimited roll back. When roll is "nearest", the nearest value is used. Default to FALSE (no rolling)
`rollends`	A logical vector length 2 (a single logical is recycled). When rolling forward (e.g. roll=TRUE) if a value is past the last observation within each group defined by the join columns, rollends[2]=TRUE will roll the last value forwards. rollends[1]=TRUE will roll the first value backwards if the value is before it. If rollends=FALSE the value of i must fall in a gap in x but not after the end or before the beginning of the data, for that group defined by all but the last join column. When roll is a finite number, that limit is also applied when rolling the end

Examples

library(dplyr)
library(lubridate)
df <- tibble(
    id    = c(1, 1, 1, 1),
    datem  = as.monthly(mdy(c("01/01/1992", "02/01/1992", "04/01/1992", "7/11/1992"))),
    value = c(4.1, 4.5, 3.3, 3.2)
)
df %>% group_by(id) %>% fill_gap(datem)
df %>% group_by(id) %>% fill_gap(datem, roll = 1)
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest")
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest", full = TRUE)
library(dplyr)
library(lubridate)
df <- tibble(
    id    = c(1, 1, 1, 1),
    datem  = as.monthly(mdy(c("01/01/1992", "02/01/1992", "04/01/1992", "7/11/1992"))),
    value = c(4.1, 4.5, 3.3, 3.2)
)
df %>% group_by(id) %>% fill_gap(datem)
df %>% group_by(id) %>% fill_gap(datem, roll = 1)
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest")
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest", full = TRUE)

Check whether a data.frame is a panel

Description

Check whether a data.frame is a panel

Usage

is.panel(x, ...)
is.panel(x, ...)

Arguments

`x`	a data frame
`...`	a time variable

Value

The function is.panel check that there are no duplicate combinations of the variables in ... and that no observation is missing for the last variable in ... (the time variable).

Examples

library(dplyr)
df <- tibble(
    id1    = c(1, 1, 1, 2, 2),
    id2   = 1:5,
    year  = c(1991, 1993, NA, 1992, 1992),
    value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)
df %>% group_by(id1) %>% is.panel(year)
df1 <- df %>% filter(!is.na(year))
df1 %>% is.panel(year)
df1 %>% group_by(id1) %>% is.panel(year)
df1 %>% group_by(id1, id2) %>% is.panel(year)
library(dplyr)
df <- tibble(
    id1    = c(1, 1, 1, 2, 2),
    id2   = 1:5,
    year  = c(1991, 1993, NA, 1992, 1992),
    value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)
df %>% group_by(id1) %>% is.panel(year)
df1 <- df %>% filter(!is.na(year))
df1 %>% is.panel(year)
df1 %>% group_by(id1) %>% is.panel(year)
df1 %>% group_by(id1, id2) %>% is.panel(year)

Join two data frames together

Description

Join two data frames together

Usage

join(
  x,
  y,
  kind,
  on = intersect(names(x), names(y)),
  suffixes = c(".x", ".y"),
  check = m ~ m,
  gen = FALSE,
  inplace = FALSE,
  update = FALSE,
  type
)
join(
  x,
  y,
  kind,
  on = intersect(names(x), names(y)),
  suffixes = c(".x", ".y"),
  check = m ~ m,
  gen = FALSE,
  inplace = FALSE,
  update = FALSE,
  type
)

Arguments

`x`	The master data.frame
`y`	The using data.frame
`kind`	The kind of (SQL) join among "full" (default), "left", "right", "inner", "semi", "anti" and "cross".
`on`	Character vectors specifying variables to match on. Default to common names between x and y.
`suffixes`	A character vector of length 2 specifying suffix of overlapping columns. Defaut to ".x" and ".y".
`check`	A formula checking for the presence of duplicates. Specifying 1~m (resp m~1, 1~1) checks that joined variables uniquely identify observations in x (resp y, both).
`gen`	Name of new variable to mark result, or the boolean FALSE (default) if no such variable should be created. The variable equals 1 for rows in master only, 2 for rows in using only, 3 for matched rows.
`inplace`	A boolean. In case "kind"= "left" and RHS of check is 1, the merge can be one in-place.
`update`	A boolean. For common variables in x and y not specified in "on", replace missing observations by the non missing observations in y.
`type`	Deprecated

Value

A data.frame that joins rows in master and using datases. Importantly, if x or y are not keyed, the join may change their row orders.

Examples

library(dplyr)
x <- data.frame(a = rep(1:2, each = 3), b=1:6)
y <- data.frame(a = 0:1, bb = 10:11)
join(x, y, kind = "full")
join(x, y, kind = "left", gen = "_merge")
join(x, y, kind = "right", gen = "_merge")
join(x, y, kind = "inner", check = m~1)
join(x, y, kind = "semi")
join(x, y, kind = "anti")
y <- rename(y, b = bb)
join(x, y, kind = "full", on = "a")
join(x, y, kind = "full", on = "a", suffixes = c("",".i"))
y <- data.frame(a = 0:1, bb = 10:11)
join(x, y, kind = "left", check = m~1)
x <- data.frame(a = c(1,2), b=c(NA, 2))
y <- data.frame(a = c(1,2), b = 10:11)
join(x, y, kind = "left", on = "a",  update = TRUE)
join(x, y, kind = "left", on = "a", check = m~1,  update = TRUE)
library(dplyr)
x <- data.frame(a = rep(1:2, each = 3), b=1:6)
y <- data.frame(a = 0:1, bb = 10:11)
join(x, y, kind = "full")
join(x, y, kind = "left", gen = "_merge")
join(x, y, kind = "right", gen = "_merge")
join(x, y, kind = "inner", check = m~1)
join(x, y, kind = "semi")
join(x, y, kind = "anti")
y <- rename(y, b = bb)
join(x, y, kind = "full", on = "a")
join(x, y, kind = "full", on = "a", suffixes = c("",".i"))
y <- data.frame(a = 0:1, bb = 10:11)
join(x, y, kind = "left", check = m~1)
x <- data.frame(a = c(1,2), b=c(NA, 2))
y <- data.frame(a = c(1,2), b = 10:11)
join(x, y, kind = "left", on = "a",  update = TRUE)
join(x, y, kind = "left", on = "a", check = m~1,  update = TRUE)

Count number of non missing observations

Description

Count number of non missing observations

Usage

n_narm(...)
n_narm(...)

Arguments

...

a sequence of vectors, matrices and data frames.

Examples

                         
n_narm(1:100, c(NA, 1:99))
n_narm(1:100, c(NA, 1:99))

Weighted quantile of type 2 (similar to Stata _pctile)

Description

Weighted quantile of type 2 (similar to Stata _pctile)

Usage

pctile(x, probs = c(0.25, 0.5, 0.75), wt = NULL, na.rm = FALSE)
pctile(x, probs = c(0.25, 0.5, 0.75), wt = NULL, na.rm = FALSE)

Arguments

`x`	A vector
`probs`	A vector of probabilities
`wt`	A weight vector
`na.rm`	Should missing values be returned?

Plot the mean of y over the mean of x within bins of x.

Description

Plot the mean of y over the mean of x within bins of x.

Usage

stat_binmean(
  mapping = NULL,
  data = NULL,
  geom = "point",
  position = "identity",
  show.legend = NA,
  inherit.aes = TRUE,
  na.rm = FALSE,
  n = 20,
  ...
)
stat_binmean(
  mapping = NULL,
  data = NULL,
  geom = "point",
  position = "identity",
  show.legend = NA,
  inherit.aes = TRUE,
  na.rm = FALSE,
  n = 20,
  ...
)

Arguments

`mapping`	Set of aesthetic mappings created by `aes()`. If specified and `inherit.aes = TRUE` (the default), it is combined with the default mapping at the top level of the plot. You must supply `mapping` if there is no plot mapping.
`data`	The data to be displayed in this layer. There are three options: If `NULL`, the default, the data is inherited from the plot data as specified in the call to `ggplot()`. A `data.frame`, or other object, will override the plot data. All objects will be fortified to produce a data frame. See `fortify()` for which variables will be created. A `function` will be called with a single argument, the plot data. The return value must be a `data.frame`, and will be used as the layer data. A `function` can be created from a `formula` (e.g. `~ head(.x, 10)`).
`geom`	The geometric object to use to display the data, either as a `ggproto` `Geom` subclass or as a string naming the geom stripped of the `geom_` prefix (e.g. `"point"` rather than `"geom_point"`)
`position`	Position adjustment, either as a string naming the adjustment (e.g. `"jitter"` to use `position_jitter`), or the result of a call to a position adjustment function. Use the latter if you need to change the settings of the adjustment.
`show.legend`	logical. Should this layer be included in the legends? `NA`, the default, includes if any aesthetics are mapped. `FALSE` never includes, and `TRUE` always includes. It can also be a named logical vector to finely select the aesthetics to display.
`inherit.aes`	If `FALSE`, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. `borders()`.
`na.rm`	If `FALSE` (the default), removes missing values with a warning. If `TRUE` silently removes missing values.
`n`	number of x-bins. Default to 20. Set to zero if you want to use distinct value of x for grouping.
`...`	Other arguments passed on to `layer()`. These are often aesthetics, used to set an aesthetic to a fixed value, like `colour = "red"` or `size = 3`. They may also be parameters to the paired geom/stat.

Value

a data.frame with additional columns:

`xtile`	bins for x
`x`	mean of x
`y`	mean of y

Examples

library(ggplot2)
g <- ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean(n = 10)
g + stat_smooth(method = "lm", se = FALSE)
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10)
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, weight = Petal.Length)) + stat_binmean(n = 10)
library(ggplot2)
g <- ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean(n = 10)
g + stat_smooth(method = "lm", se = FALSE)
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10)
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, weight = Petal.Length)) + stat_binmean(n = 10)

A package for applied research

Description

A package for applied research

Gives summary statistics (corresponds to Stata command summarize)

Description

Gives summary statistics (corresponds to Stata command summarize)

Usage

sum_up(df, ..., d = FALSE, wt = NULL)
sum_up(df, ..., d = FALSE, wt = NULL)

Arguments

`df`	a data.frame
`...`	Variables to include. Defaults to all non-grouping variables. See the select documentation.
`d`	Should detailed summary statistics be printed?
`wt`	Weights. Default to NULL.

Value

a data.frame

Examples

library(dplyr)
N <- 100
df <- tibble(
  id = 1:N,
  v1 = sample(5, N, TRUE),
  v2 = sample(1e6, N, TRUE)
)
sum_up(df)
sum_up(df, v2, d = TRUE)
sum_up(df, v2, wt = v1)
df %>% group_by(v1) %>% sum_up(starts_with("v"))
library(dplyr)
N <- 100
df <- tibble(
  id = 1:N,
  v1 = sample(5, N, TRUE),
  v2 = sample(1e6, N, TRUE)
)
sum_up(df)
sum_up(df, v2, d = TRUE)
sum_up(df, v2, wt = v1)
df %>% group_by(v1) %>% sum_up(starts_with("v"))

Returns cross tabulation

Description

Returns cross tabulation

Usage

tab(x, ..., wt = NULL, na.rm = FALSE, sort = TRUE)
tab(x, ..., wt = NULL, na.rm = FALSE, sort = TRUE)

Arguments

`x`	a vector or a data.frame
`...`	Variable(s) to include. If length is two, a special cross tabulation table is printed although a long data.frame is always (invisibly) returned.
`wt`	Frequency weights. Default to NULL.
`na.rm`	Remove missing values. Default to FALSE
`sort`	Boolean. Default to TRUE

Value

a data.frame sorted by variables in ..., and with columns "Freq.", "Percent", and "Cum." for counts.

Examples

# setup
library(dplyr)
N <- 1e2 ; K = 10
df <- tibble(
  id = sample(c(NA,1:5), N/K, TRUE),
  v1 =  sample(1:5, N/K, TRUE)                       
)
# one-way tabulation
df %>% tab(id)
df %>% tab(id, wt = v1)
# two-way tabulation
df %>% tab(id, v1)
df %>% filter(id >= 3) %>% tab(id)
# setup
library(dplyr)
N <- 1e2 ; K = 10
df <- tibble(
  id = sample(c(NA,1:5), N/K, TRUE),
  v1 =  sample(1:5, N/K, TRUE)                       
)
# one-way tabulation
df %>% tab(id)
df %>% tab(id, wt = v1)
# two-way tabulation
df %>% tab(id, v1)
df %>% filter(id >= 3) %>% tab(id)

Create unique names within a list, a data.frame, or an environment

Description

Create unique names within a list, a data.frame, or an environment

Usage

tempname(where = globalenv(), n = 1, prefix = ".temp", inherits = TRUE)
tempname(where = globalenv(), n = 1, prefix = ".temp", inherits = TRUE)

Arguments

`where`	A chracter vector, list or an environment
`n`	An integar that specifies length of the output
`prefix`	A character vector that specifies prefix for new name
`inherits`	Should the name unique also in the enclosing frames of the environment?

Examples

tempname(c("temp1", "temp3"), 4)
tempname(globalenv())
tempname(data.frame(temp = 1), n = 3)
tempname(c("temp1", "temp3"), 4)
tempname(globalenv())
tempname(data.frame(temp = 1), n = 3)

lead and lag with respect to a time variable

Description

lead and lag with respect to a time variable

Usage

tlead(x, n = 1L, time, default = NA)

tlag(x, n = 1L, time, default = NA)
tlead(x, n = 1L, time, default = NA)

tlag(x, n = 1L, time, default = NA)

Arguments

`x`	a vector of values
`n`	a positive integer of length 1, giving the number of positions to lead or lag by. When the package lubridate is loaded, it can be a period when using with time (see the lubridate function minutes, hours, days, weeks, months and years)
`time`	time variable
`default`	value used for non-existant rows. Defaults to `NA`.

Examples

date <- c(1989, 1991, 1992)
value <- c(4.1, 4.5, 3.3)
tlag(value, 1, time = date) #  returns value in year - 1
library(lubridate)
date <- as.monthly(mdy(c("01/04/1992", "03/15/1992", "04/03/1992")))
tlag(value, time = date) 
library(dplyr)
df <- tibble(
   id    = c(1, 2, 2),
   date  = date,
   value = value
)
df %>% group_by(id) %>% mutate(valuel = tlag(value, n = 1, time = date))
date <- c(1989, 1991, 1992)
value <- c(4.1, 4.5, 3.3)
tlag(value, 1, time = date) #  returns value in year - 1
library(lubridate)
date <- as.monthly(mdy(c("01/04/1992", "03/15/1992", "04/03/1992")))
tlag(value, time = date) 
library(dplyr)
df <- tibble(
   id    = c(1, 2, 2),
   date  = date,
   value = value
)
df %>% group_by(id) %>% mutate(valuel = tlag(value, n = 1, time = date))

Winsorize a numeric vector

Description

Winsorize a numeric vector

Usage

winsorize(
  x,
  probs = NULL,
  cutpoints = NULL,
  replace = c(cutpoints[1], cutpoints[2]),
  verbose = TRUE
)

winsorise(
  x,
  probs = NULL,
  cutpoints = NULL,
  replace = c(cutpoints[1], cutpoints[2]),
  verbose = TRUE
)
winsorize(
  x,
  probs = NULL,
  cutpoints = NULL,
  replace = c(cutpoints[1], cutpoints[2]),
  verbose = TRUE
)

winsorise(
  x,
  probs = NULL,
  cutpoints = NULL,
  replace = c(cutpoints[1], cutpoints[2]),
  verbose = TRUE
)

Arguments

`x`	A vector of values
`probs`	A vector of probabilities that can be used instead of cutpoints. Quantiles are computed as the inverse of the empirical distribution function (type = 1)
`cutpoints`	Cutpoints under and above which are defined outliers. Default is (median - five times interquartile range, median + five times interquartile range). Compared to bottom and top percentile, this takes into account the whole distribution of the vector.
`replace`	Values by which outliers are replaced. Default to cutpoints. A frequent alternative is NA.
`verbose`	Boolean. Should the percentage of replaced values printed?

Examples

                         
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))

Bin variable in groups (similar to Stata xtile)

Description

Bin variable in groups (similar to Stata xtile)

Usage

xtile(x, n = NULL, probs = NULL, cutpoints = NULL, wt = NULL)
xtile(x, n = NULL, probs = NULL, cutpoints = NULL, wt = NULL)

Arguments

`x`	A vector
`n`	A numeric specifying number of quantiles. Can be used instead of cutpoints
`probs`	A vector of probabilities that an be used instead of cutpoints. Quantiles are computed as the inverse of the empirical distribution function (type = 1)
`cutpoints`	Cutpoints to use when `nq` is not specified. For instance `cutpoints = 0.4` creates two groups, one for observations equal or below 0.4, one for observations superior to 0.4.
`wt`	A variable specifying weight in case the option n_quantiles is specified.

Value

An integer vector representing groups corresponding to cutpoints. Includes missing values when present in the original vector.

Examples

x <- c(NA, 1:10)                   
xtile(x, n = 3) # 3 groups based on terciles
xtile(x, probs = c(0.3, 0.7)) # 3 groups based on two quantiles
xtile(x, cutpoints = c(2, 3)) # 3 groups based on two cutpoints
x <- c(NA, 1:10)                   
xtile(x, n = 3) # 3 groups based on terciles
xtile(x, probs = c(0.3, 0.7)) # 3 groups based on two quantiles
xtile(x, cutpoints = c(2, 3)) # 3 groups based on two cutpoints

Package 'statar'

Help Index

Elapsed dates (monthly, quarterly)

Description

Usage

Arguments

Details

Examples

Add rows corresponding to gaps in some variable

Description

Usage

Arguments

Examples

Check whether a data.frame is a panel

Description

Usage

Arguments

Value

Examples

Join two data frames together

Description

Usage

Arguments

Value

Examples

Count number of non missing observations

Description

Usage

Arguments

Examples

Weighted quantile of type 2 (similar to Stata _pctile)

Description

Usage

Arguments

Plot the mean of y over the mean of x within bins of x.

Description

Usage

Arguments

Value

Examples

A package for applied research

Description

Gives summary statistics (corresponds to Stata command summarize)

Description

Usage

Arguments

Value

Examples

Returns cross tabulation

Description

Usage

Arguments

Value

Examples

Create unique names within a list, a data.frame, or an environment

Description

Usage

Arguments

Examples

lead and lag with respect to a time variable

Description

Usage

Arguments

Examples

Winsorize a numeric vector

Description

Usage

Arguments

Examples

Bin variable in groups (similar to Stata xtile)

Description

Usage

Arguments

Value

Examples