Find and Replace in R, Part 1: recode in the library car


R function to adjust p.values and show only thresholded correlation coefficients in matrix

# function
# takes data frame, calculates correlations, adjusts p.values, outputs p.values in matrix
# also provides thresholded matrix of correlation coefficients (r)

adj.cor = function(df, p.adjust = FALSE, p.adjust.method = “none”, threshold = 1){
cor_test = rcor.test(df, p.adjust = p.adjust, p.adjust.method = p.adjust.method)
r.mat = cor_test$cor.mat     # matrix of coefficient values
p.list = cor_test$p.values   # p.list will be 3 columns

# initiate empty matrix for the p values only
MAT = matrix(, nrow = 76, ncol = 76)
diag(MAT) = 0

# starting to convert p.list to p.values matrix
for(ind in 1:(length(p.list[,1]))){
var1 = p.list[ind, 1] # var1
var2 = p.list[ind, 2] # var2
p.value = p.list[ind, 3] # p value
MAT[var1, var2] = p.value
MAT[var2, var1] = p.value
rownames(MAT) = names(df)
colnames(MAT) = names(df)
# At this point, MAT has the p.values in a matrix

# subset only the coefficients with p values < 0.05 (or threshold)
subset = ifelse(MAT < threshold, r.mat, NA)
rownames(subset) = names(df)
colnames(subset) = names(df)

output = list(adj.p.values = MAT, threshold.r = subset)

### No Corrections ####
cor = adj.cor(data.frame, p.adjust = FALSE, p.adjust.method = “none”, threshold = 0.05)
pval = cor$adj.p.values  # shows the p.values in a matrix format
thresR = cor$threshold.r  # shows only the significant <0.05 (if you don’t want any thresholding, put threshold =1 above)
# plotting thresholded r’s in matrix with cor.plot from the psych package
cor.plot(thresR, show.legend = TRUE, main = “p < 0.05”, numbers = FALSE)

### Bonferroni ####
cor = adj.cor(data.frame, p.adjust = FALSE, p.adjust.method = “bonferroni”, threshold = 0.05)
pval = cor$adj.p.values  # shows the p.values in a matrix format
thresR = cor$threshold.r  # shows only the q value <0.05
cor.plot(thresR, show.legend = TRUE, main = “bonferroni correction”, numbers = FALSE)

### FDR – BH ####
cor = adj.cor(data.frame, p.adjust = FALSE, p.adjust.method = “BH”, threshold = 0.05)
pval = cor$adj.p.values  # shows the p.values in a matrix format
thresR = cor$threshold.r  # shows only the q value <0.05
cor.plot(thresR, show.legend = TRUE, main = “BH correction, numbers = FALSE)

Check out other useful R tips and tricks here!

What is the False Discovery Rate?

When correcting for the multiple testing problem, you’ve probably been familiar with the stringent Bonferroni correction.  You’ve also probably heard of the False Discovery Rate (FDR), but what is the main difference between the two corrections?

I was browsing around for a simple explanation and came across the totallab website (see below).

I quote:

[FDR] controls the number of false discoveries in those tests that result in a discovery (i.e. a significant result). Because of this, it is less conservative than the Bonferroni approach and has greater ability (i.e. power) to find truly significant results.

Another way to look at the difference is that a p-value of 0.05 implies that 5% of all tests will result in false positives. An FDR adjusted p-value (or q-value) of 0.05 implies that 5% of significant tests will result in false positives. The latter is clearly a far smaller quantity.

Selecting and visualizing only significant correlation coefficients in matrix

You have:
1) a matrix of correlation coefficients (e.g., matrix A)
2) a matrix of their p-values (e.g., matrix B)

You want to:
1) visualize the correlation coefficients in a correlogram
2) visualize the coefficients with only significant p-values

What to do:
output = corr.test(rawData)
names(output)  # to take a look at the available output statistics
A = output$r    # matrix A here contains the correlation coefficients
B = output$p   # matrix B here contains the corresponding p-values

# first, to visualize the entire matrix in a correlogram
corrgram(A)  # visualizing the correlation coefficients corrgram(B)  # visualizing the p-values

# But, you also want to visualize the correlation coefficients with significant p-values!
# to do that, you need to select only the matrix elements with significant p-values
# if it is above 0.05 (not significant), then replace with NAs)
sig_element = ifelse(B < 0.05, A, NA)

# can plot the new matrix
corrgram(sig_element)  # this displays only the correlation coefficients that have significant p-values

Note:  If you have NA’s in your matrix, sometimes corrgram might not be able to visualize your matrix. This may be due to the NA’s that are present in the diagonals.  Replace the NA’s in the diagonals with 1’s and you might be able fix that issue (e.g., diag(A) = 1)

Better Alternative for plotting:
plot with cor.plot also from the psych package – I realized that it is more flexible as it handles missing values (NAs) better than the corrgram().
# or
cor.plot(sig_element, show.legend = TRUE, main = “title”, numbers = TRUE, labels = names)

CAUTION:  Please be careful when calculating multiple correlation coefficients.  Please correct for multiple comparisons when appropriate!

For more information, please check out:

Check out other useful R tips and tricks here!

Missing Data? Try Multiple Imputation

It is common for researchers to exclude participants with missing data — but what are some ways to keep the participants and analyse the data in unbiased ways?

Data imputation involves replacing missing data with plausible values based on the Monte Carlo technique. Here, the missing values are replaced by several simulated versions.

1. you have missing data in a data set
2. missing data are simulated with different versions –> several simulated data sets
3. analyze data set (version #1) as it were a complete data set
4. repeat step 3 with other data set versions (e.g., #2, #3, …, #N). [for low rates of missing data, only 3-10 simulated data sets are needed.]
5. combine (average) the results to produce a single point estimates and confidence intervals (or p-values) that incorporate missing-data uncertainty.

Screen Shot 2014-02-12 at 11.06.53 PM

How do I generate imputations for the missing values?
The imputation model:
Impose a probability model on the complete data (observed and missing values).

Key points on what to include in the imputation model: [~ 30 min into the Amelia I video (link below)]
1. Include all the variables that you want to include in your analysis model
e.g., age, education, ideology, income … have to include all the variables you will need later in analysis stage

2. Include variables that are highly predictive of the variables you are going to analyse
e.g., Voter turnout analysis: ideology — e.g., include views on homelessness, abortion (predictors of variables you’re interested in)

3. Include variables that are highly predictive of the missingness of your data
e.g., income the predict missingness — throw it in model as well

“…Because you are throwing in a lot of variables in the imputation model than the variables you would be looking at the analysis stage and that’s OK!”

*Note: this method assumes MAR (Missing At Random)

– “imputation model should be compatible with the analyses to be performed on the imputed datasets… In general, any association that may be important in subsequent analyses should be in the imputation model” … that means include those relevant variables in the imputation model.
– On the other hand, you don’t necessarily have to examine those variables in your final analyses (unless it’s of interest pertaining to the outcome).

“When working with binary or ordered categorical variables, it’s often acceptable to impute under under a normality assumption and then round off the continuous imputed values to the nearest category. Variables whose distributions are heavily skewed may be transformed (e.g., logs) to approximate normality and then transformed back to their original scale after imputation.” – the multiple imputation FAQ page

Other notes:
– if you have ordinal variables – code them to as close to an interval scale as possible
– include any non-linear relationship in your model, e.g., age and age-squared in voter turnout, so include age-squared in your imputation model
– if you will look at any interactions, throw those terms in your imputation model as well.

Software for Multiple Imputation:
– R (Amelia II, mi, etc.)
– Strata (mi, ice, mim, etc.)
– …

There are many softwares to do multiple imputations, but if you use R, you can check out AmeliaII
– the user guide is very clear and helpful – can run through the example dataset and code.

Additional Resources:

video – Innovation in Amelia I (explanation of multiple imputation in general):

the multiple imputation FAQ page:

How to Calculate R-squared Change in R

In R, the anova.lm() does not give out R-squared change when you are comparing different regression models.
anova.lm() does provide you with the F Change, df1, df2, and Sig F Change in the output.
Sure, you can calculate the R-squared change yourself, but there’s a package for it!  The output is also more intuitive!

# first, remember to install and load the package.

# specify the regression models
model1 = lm(dataset$var1 ~ dataset$var2)
model2 = lm(dataset$var1 ~ dataset$var2 + dataset$var3)

# compare the two models!
# to get the R-squared change and other stats
lm.deltaR2(model1, model2)

You can check out this forum for more information:

Check if at least two or more booleans or conditions are True in R and Debugging in R

But, what do you do when you want to find out if at least two or more of the conditions are true?

You take advantage of the Boolean math!   Here’s how:

# for at least two of the conditions being true

if ( ((condA) + (condB) + (condC) + (condD)) >=2 ) {

# something happens here


For more, check out:

Now, I also have another post about Debugging in R, which uses the debug()