Title: | Tools Inspired by 'Stata' to Manipulate Tabular Data |
---|---|
Description: | A set of tools inspired by 'Stata' to explore data.frames ('summarize', 'tabulate', 'xtile', 'pctile', 'binscatter', elapsed quarters/month, lead/lag). |
Authors: | Matthieu Gomez [aut, cre] |
Maintainer: | Matthieu Gomez <[email protected]> |
License: | GPL-2 |
Version: | 0.7.6 |
Built: | 2024-11-11 03:44:24 UTC |
Source: | https://github.com/matthieugomez/statar |
Elapsed dates (monthly, quarterly)
as.quarterly(x) is.quarterly(x) as.monthly(x) is.monthly(x)
as.quarterly(x) is.quarterly(x) as.monthly(x) is.monthly(x)
x |
a vector |
Monthly and quarterly dates are stored as integers, representing the number of elapsed calendar periods since 01/01/1970. As yearmonth
and yearqtr
the package zoo
, these dates are printed in a way that fits their frequency (YYY
qq
, YYY
mMM
). The only difference is that, monthly
, and quarterly
are integers, which removes issues due to floating points (particularly important when merging). This also allows to use arithmetic on perios, ie date
+ 1 adds one period rather than one day.
Methods to convert from and to Dates or POSIXlt are provided. In particular, you may use lubridate week
month
and year
to extract information from elapsed dates.
library(lubridate) library(dplyr) date <- mdy(c("04/03/1992", "01/04/1992", "03/15/1992")) datem <- as.monthly(date) is.monthly(datem) as.quarterly(date) as.character(datem) datem + 1 df <- tibble(datem) # filter(df, month(datem) == 1) seq(datem[1], datem[2]) as.Date(datem) as.POSIXlt(datem) as.POSIXct(datem) week(datem)
library(lubridate) library(dplyr) date <- mdy(c("04/03/1992", "01/04/1992", "03/15/1992")) datem <- as.monthly(date) is.monthly(datem) as.quarterly(date) as.character(datem) datem + 1 df <- tibble(datem) # filter(df, month(datem) == 1) seq(datem[1], datem[2]) as.Date(datem) as.POSIXlt(datem) as.POSIXct(datem) week(datem)
Add rows corresponding to gaps in some variable
fill_gap( x, ..., full = FALSE, roll = FALSE, rollends = if (roll == "nearest") c(TRUE, TRUE) else if (roll >= 0) c(FALSE, TRUE) else c(TRUE, FALSE) )
fill_gap( x, ..., full = FALSE, roll = FALSE, rollends = if (roll == "nearest") c(TRUE, TRUE) else if (roll >= 0) c(FALSE, TRUE) else c(TRUE, FALSE) )
x |
A data frame |
... |
a time variable |
full |
A boolean. When full = FALSE (default), the function creates rows corresponding to all missing times between the min and max of |
roll |
When roll is a positive number, values are carried forward. roll=TRUE is equivalent to roll=+Inf. When roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf for unlimited roll back. When roll is "nearest", the nearest value is used. Default to FALSE (no rolling) |
rollends |
A logical vector length 2 (a single logical is recycled). When rolling forward (e.g. roll=TRUE) if a value is past the last observation within each group defined by the join columns, rollends[2]=TRUE will roll the last value forwards. rollends[1]=TRUE will roll the first value backwards if the value is before it. If rollends=FALSE the value of i must fall in a gap in x but not after the end or before the beginning of the data, for that group defined by all but the last join column. When roll is a finite number, that limit is also applied when rolling the end |
library(dplyr) library(lubridate) df <- tibble( id = c(1, 1, 1, 1), datem = as.monthly(mdy(c("01/01/1992", "02/01/1992", "04/01/1992", "7/11/1992"))), value = c(4.1, 4.5, 3.3, 3.2) ) df %>% group_by(id) %>% fill_gap(datem) df %>% group_by(id) %>% fill_gap(datem, roll = 1) df %>% group_by(id) %>% fill_gap(datem, roll = "nearest") df %>% group_by(id) %>% fill_gap(datem, roll = "nearest", full = TRUE)
library(dplyr) library(lubridate) df <- tibble( id = c(1, 1, 1, 1), datem = as.monthly(mdy(c("01/01/1992", "02/01/1992", "04/01/1992", "7/11/1992"))), value = c(4.1, 4.5, 3.3, 3.2) ) df %>% group_by(id) %>% fill_gap(datem) df %>% group_by(id) %>% fill_gap(datem, roll = 1) df %>% group_by(id) %>% fill_gap(datem, roll = "nearest") df %>% group_by(id) %>% fill_gap(datem, roll = "nearest", full = TRUE)
Check whether a data.frame is a panel
is.panel(x, ...)
is.panel(x, ...)
x |
a data frame |
... |
a time variable |
The function is.panel
check that there are no duplicate combinations of the variables in ... and that no observation is missing for the last variable in ... (the time variable).
library(dplyr) df <- tibble( id1 = c(1, 1, 1, 2, 2), id2 = 1:5, year = c(1991, 1993, NA, 1992, 1992), value = c(4.1, 4.5, 3.3, 3.2, 5.2) ) df %>% group_by(id1) %>% is.panel(year) df1 <- df %>% filter(!is.na(year)) df1 %>% is.panel(year) df1 %>% group_by(id1) %>% is.panel(year) df1 %>% group_by(id1, id2) %>% is.panel(year)
library(dplyr) df <- tibble( id1 = c(1, 1, 1, 2, 2), id2 = 1:5, year = c(1991, 1993, NA, 1992, 1992), value = c(4.1, 4.5, 3.3, 3.2, 5.2) ) df %>% group_by(id1) %>% is.panel(year) df1 <- df %>% filter(!is.na(year)) df1 %>% is.panel(year) df1 %>% group_by(id1) %>% is.panel(year) df1 %>% group_by(id1, id2) %>% is.panel(year)
Join two data frames together
join( x, y, kind, on = intersect(names(x), names(y)), suffixes = c(".x", ".y"), check = m ~ m, gen = FALSE, inplace = FALSE, update = FALSE, type )
join( x, y, kind, on = intersect(names(x), names(y)), suffixes = c(".x", ".y"), check = m ~ m, gen = FALSE, inplace = FALSE, update = FALSE, type )
x |
The master data.frame |
y |
The using data.frame |
kind |
The kind of (SQL) join among "full" (default), "left", "right", "inner", "semi", "anti" and "cross". |
on |
Character vectors specifying variables to match on. Default to common names between x and y. |
suffixes |
A character vector of length 2 specifying suffix of overlapping columns. Defaut to ".x" and ".y". |
check |
A formula checking for the presence of duplicates. Specifying 1~m (resp m~1, 1~1) checks that joined variables uniquely identify observations in x (resp y, both). |
gen |
Name of new variable to mark result, or the boolean FALSE (default) if no such variable should be created. The variable equals 1 for rows in master only, 2 for rows in using only, 3 for matched rows. |
inplace |
A boolean. In case "kind"= "left" and RHS of check is 1, the merge can be one in-place. |
update |
A boolean. For common variables in x and y not specified in "on", replace missing observations by the non missing observations in y. |
type |
Deprecated |
A data.frame that joins rows in master and using datases. Importantly, if x or y are not keyed, the join may change their row orders.
library(dplyr) x <- data.frame(a = rep(1:2, each = 3), b=1:6) y <- data.frame(a = 0:1, bb = 10:11) join(x, y, kind = "full") join(x, y, kind = "left", gen = "_merge") join(x, y, kind = "right", gen = "_merge") join(x, y, kind = "inner", check = m~1) join(x, y, kind = "semi") join(x, y, kind = "anti") y <- rename(y, b = bb) join(x, y, kind = "full", on = "a") join(x, y, kind = "full", on = "a", suffixes = c("",".i")) y <- data.frame(a = 0:1, bb = 10:11) join(x, y, kind = "left", check = m~1) x <- data.frame(a = c(1,2), b=c(NA, 2)) y <- data.frame(a = c(1,2), b = 10:11) join(x, y, kind = "left", on = "a", update = TRUE) join(x, y, kind = "left", on = "a", check = m~1, update = TRUE)
library(dplyr) x <- data.frame(a = rep(1:2, each = 3), b=1:6) y <- data.frame(a = 0:1, bb = 10:11) join(x, y, kind = "full") join(x, y, kind = "left", gen = "_merge") join(x, y, kind = "right", gen = "_merge") join(x, y, kind = "inner", check = m~1) join(x, y, kind = "semi") join(x, y, kind = "anti") y <- rename(y, b = bb) join(x, y, kind = "full", on = "a") join(x, y, kind = "full", on = "a", suffixes = c("",".i")) y <- data.frame(a = 0:1, bb = 10:11) join(x, y, kind = "left", check = m~1) x <- data.frame(a = c(1,2), b=c(NA, 2)) y <- data.frame(a = c(1,2), b = 10:11) join(x, y, kind = "left", on = "a", update = TRUE) join(x, y, kind = "left", on = "a", check = m~1, update = TRUE)
Count number of non missing observations
n_narm(...)
n_narm(...)
... |
a sequence of vectors, matrices and data frames. |
n_narm(1:100, c(NA, 1:99))
n_narm(1:100, c(NA, 1:99))
Weighted quantile of type 2 (similar to Stata _pctile)
pctile(x, probs = c(0.25, 0.5, 0.75), wt = NULL, na.rm = FALSE)
pctile(x, probs = c(0.25, 0.5, 0.75), wt = NULL, na.rm = FALSE)
x |
A vector |
probs |
A vector of probabilities |
wt |
A weight vector |
na.rm |
Should missing values be returned? |
Plot the mean of y over the mean of x within bins of x.
stat_binmean( mapping = NULL, data = NULL, geom = "point", position = "identity", show.legend = NA, inherit.aes = TRUE, na.rm = FALSE, n = 20, ... )
stat_binmean( mapping = NULL, data = NULL, geom = "point", position = "identity", show.legend = NA, inherit.aes = TRUE, na.rm = FALSE, n = 20, ... )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
geom |
The geometric object to use to display the data, either as a
|
position |
Position adjustment, either as a string naming the adjustment
(e.g. |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
na.rm |
If |
n |
number of x-bins. Default to 20. Set to zero if you want to use distinct value of x for grouping. |
... |
Other arguments passed on to |
a data.frame with additional columns:
xtile |
bins for x |
x |
mean of x |
y |
mean of y |
library(ggplot2) g <- ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean(n = 10) g + stat_smooth(method = "lm", se = FALSE) ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10) ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, weight = Petal.Length)) + stat_binmean(n = 10)
library(ggplot2) g <- ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean(n = 10) g + stat_smooth(method = "lm", se = FALSE) ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10) ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, weight = Petal.Length)) + stat_binmean(n = 10)
Gives summary statistics (corresponds to Stata command summarize)
sum_up(df, ..., d = FALSE, wt = NULL)
sum_up(df, ..., d = FALSE, wt = NULL)
df |
a data.frame |
... |
Variables to include. Defaults to all non-grouping variables. See the select documentation. |
d |
Should detailed summary statistics be printed? |
wt |
Weights. Default to NULL. |
a data.frame
library(dplyr) N <- 100 df <- tibble( id = 1:N, v1 = sample(5, N, TRUE), v2 = sample(1e6, N, TRUE) ) sum_up(df) sum_up(df, v2, d = TRUE) sum_up(df, v2, wt = v1) df %>% group_by(v1) %>% sum_up(starts_with("v"))
library(dplyr) N <- 100 df <- tibble( id = 1:N, v1 = sample(5, N, TRUE), v2 = sample(1e6, N, TRUE) ) sum_up(df) sum_up(df, v2, d = TRUE) sum_up(df, v2, wt = v1) df %>% group_by(v1) %>% sum_up(starts_with("v"))
Returns cross tabulation
tab(x, ..., wt = NULL, na.rm = FALSE, sort = TRUE)
tab(x, ..., wt = NULL, na.rm = FALSE, sort = TRUE)
x |
a vector or a data.frame |
... |
Variable(s) to include. If length is two, a special cross tabulation table is printed although a long data.frame is always (invisibly) returned. |
wt |
Frequency weights. Default to NULL. |
na.rm |
Remove missing values. Default to FALSE |
sort |
Boolean. Default to TRUE |
a data.frame sorted by variables in ..., and with columns "Freq.", "Percent", and "Cum." for counts.
# setup library(dplyr) N <- 1e2 ; K = 10 df <- tibble( id = sample(c(NA,1:5), N/K, TRUE), v1 = sample(1:5, N/K, TRUE) ) # one-way tabulation df %>% tab(id) df %>% tab(id, wt = v1) # two-way tabulation df %>% tab(id, v1) df %>% filter(id >= 3) %>% tab(id)
# setup library(dplyr) N <- 1e2 ; K = 10 df <- tibble( id = sample(c(NA,1:5), N/K, TRUE), v1 = sample(1:5, N/K, TRUE) ) # one-way tabulation df %>% tab(id) df %>% tab(id, wt = v1) # two-way tabulation df %>% tab(id, v1) df %>% filter(id >= 3) %>% tab(id)
Create unique names within a list, a data.frame, or an environment
tempname(where = globalenv(), n = 1, prefix = ".temp", inherits = TRUE)
tempname(where = globalenv(), n = 1, prefix = ".temp", inherits = TRUE)
where |
A chracter vector, list or an environment |
n |
An integar that specifies length of the output |
prefix |
A character vector that specifies prefix for new name |
inherits |
Should the name unique also in the enclosing frames of the environment? |
tempname(c("temp1", "temp3"), 4) tempname(globalenv()) tempname(data.frame(temp = 1), n = 3)
tempname(c("temp1", "temp3"), 4) tempname(globalenv()) tempname(data.frame(temp = 1), n = 3)
lead and lag with respect to a time variable
tlead(x, n = 1L, time, default = NA) tlag(x, n = 1L, time, default = NA)
tlead(x, n = 1L, time, default = NA) tlag(x, n = 1L, time, default = NA)
x |
a vector of values |
n |
a positive integer of length 1, giving the number of positions to lead or lag by. When the package lubridate is loaded, it can be a period when using with time (see the lubridate function minutes, hours, days, weeks, months and years) |
time |
time variable |
default |
value used for non-existant rows. Defaults to |
date <- c(1989, 1991, 1992) value <- c(4.1, 4.5, 3.3) tlag(value, 1, time = date) # returns value in year - 1 library(lubridate) date <- as.monthly(mdy(c("01/04/1992", "03/15/1992", "04/03/1992"))) tlag(value, time = date) library(dplyr) df <- tibble( id = c(1, 2, 2), date = date, value = value ) df %>% group_by(id) %>% mutate(valuel = tlag(value, n = 1, time = date))
date <- c(1989, 1991, 1992) value <- c(4.1, 4.5, 3.3) tlag(value, 1, time = date) # returns value in year - 1 library(lubridate) date <- as.monthly(mdy(c("01/04/1992", "03/15/1992", "04/03/1992"))) tlag(value, time = date) library(dplyr) df <- tibble( id = c(1, 2, 2), date = date, value = value ) df %>% group_by(id) %>% mutate(valuel = tlag(value, n = 1, time = date))
Winsorize a numeric vector
winsorize( x, probs = NULL, cutpoints = NULL, replace = c(cutpoints[1], cutpoints[2]), verbose = TRUE ) winsorise( x, probs = NULL, cutpoints = NULL, replace = c(cutpoints[1], cutpoints[2]), verbose = TRUE )
winsorize( x, probs = NULL, cutpoints = NULL, replace = c(cutpoints[1], cutpoints[2]), verbose = TRUE ) winsorise( x, probs = NULL, cutpoints = NULL, replace = c(cutpoints[1], cutpoints[2]), verbose = TRUE )
x |
A vector of values |
probs |
A vector of probabilities that can be used instead of cutpoints. Quantiles are computed as the inverse of the empirical distribution function (type = 1) |
cutpoints |
Cutpoints under and above which are defined outliers. Default is (median - five times interquartile range, median + five times interquartile range). Compared to bottom and top percentile, this takes into account the whole distribution of the vector. |
replace |
Values by which outliers are replaced. Default to cutpoints. A frequent alternative is NA. |
verbose |
Boolean. Should the percentage of replaced values printed? |
v <- c(1:4, 99) winsorize(v) winsorize(v, replace = NA) winsorize(v, probs = c(0.01, 0.99)) winsorize(v, cutpoints = c(1, 50))
v <- c(1:4, 99) winsorize(v) winsorize(v, replace = NA) winsorize(v, probs = c(0.01, 0.99)) winsorize(v, cutpoints = c(1, 50))
Bin variable in groups (similar to Stata xtile)
xtile(x, n = NULL, probs = NULL, cutpoints = NULL, wt = NULL)
xtile(x, n = NULL, probs = NULL, cutpoints = NULL, wt = NULL)
x |
A vector |
n |
A numeric specifying number of quantiles. Can be used instead of cutpoints |
probs |
A vector of probabilities that an be used instead of cutpoints. Quantiles are computed as the inverse of the empirical distribution function (type = 1) |
cutpoints |
Cutpoints to use when |
wt |
A variable specifying weight in case the option n_quantiles is specified. |
An integer vector representing groups corresponding to cutpoints. Includes missing values when present in the original vector.
x <- c(NA, 1:10) xtile(x, n = 3) # 3 groups based on terciles xtile(x, probs = c(0.3, 0.7)) # 3 groups based on two quantiles xtile(x, cutpoints = c(2, 3)) # 3 groups based on two cutpoints
x <- c(NA, 1:10) xtile(x, n = 3) # 3 groups based on terciles xtile(x, probs = c(0.3, 0.7)) # 3 groups based on two quantiles xtile(x, cutpoints = c(2, 3)) # 3 groups based on two cutpoints