Color formatting of correlation table

Correlation

Correlation is a bivariate summary statistic. It basically talks of direction and magnitidue of association of two variables. Besides formatting with significance stars, color coding correlation coefficient table might be helpful to pick patterns out in a quick glimpse.

Table 1 presents correlation matrix of yield and yield component traits (a blue \(\rightarrow\) red color profile represents increasing magnitude of positive correlation between traits). Following code is helpful if somebody provides a correlation table with stars in it and tells you to prettify it. Note that lower or upper halves only cannot be used to determine the discrete color values so full column is required.

Cluster dendrogram: An introduction and showcase

A cluster analysis is a classification problem. It is dealt in several ways, one of which is hierarchial agglomeration. The method allows for easy presentation of high dimensional data, more of so when the number of observations is readily fitted into a visualization.

Here’s I deal with a case of clustering typically seen in agriculture and field research where a researcher tests typically a large number of genotypes and seeks to see them organized into distinguishable clusters using dendrogram. Data concerns observations on disease incidence in rice genotypes of various stages – germinating seed to maturity nearing crop. Following provides a descriptive summary of the observation variables.

Paste together multiple columns

# # paste together dataframe columns by column index
# take the following df

df <- data.frame(my_number = letters[1:5], 
                 column_odd1 = rnorm(5), 
                 column_even1 = rnorm(5), 
                 column_odd2 = rnorm(5), 
                 column_even2 = rnorm(5), 
                 column_odd3 = rnorm(5), 
                 column_even3 = rnorm(5))

df %>% 
  select(1) %>% 
  bind_cols(data.frame(setNames(lapply(list(c(2,3), c(4, 5), c(6, 7)), function(i) 
    do.call(sprintf, c(fmt = "%0.3f (%0.3f)", # round at third place after decimal. use %s if columns were character type
                       df[i]))), c("new_column1", "new_column2", "new_column3"))))

##   my_number     new_column1     new_column2     new_column3
## 1         a -0.682 (-1.828) -1.304 (-0.113)  1.121 (-0.586)
## 2         b -1.178 (-0.809)   0.534 (0.524) -1.330 (-0.709)
## 3         c -0.432 (-2.787) -1.586 (-0.105)  1.449 (-0.533)
## 4         d  -1.258 (0.455) -0.071 (-0.786) -0.326 (-0.607)
## 5         e   0.329 (0.525)   0.011 (1.141)  0.043 (-0.228)

Logistic Regression: Part II - Varietal adoption dataset

Binary classifier using categorical predictor

Let’s say we have two variable – survey response of farmer to willingness to adopt improved rice variety (in YES/NO) and them having been trained earlier about agricultural input management (in trained/untrained).

Read in the data and notice the summary.

rice_data <- readxl::read_xlsx(here::here("content", "blog", "data", "rice_variety_adoption.xlsx")) %>% 
  mutate_if(.predicate = is.character, as.factor)

rice_variety_adoption <- readxl::read_xlsx(here::here("content", "blog", "data", "rice_variety_adoption.xlsx")) %>%
  select(improved_variety_adoption, training) %>% 
  # convert data to suitable factor type for analysis.
  mutate_if(is.character, as.factor)

head(rice_variety_adoption) # now we have data

## # A tibble: 6 × 2
##   improved_variety_adoption training
##   <fct>                     <fct>   
## 1 No                        No      
## 2 Yes                       No      
## 3 No                        No      
## 4 Yes                       No      
## 5 Yes                       No      
## 6 No                        No

As a basic descriptive, contruct one way and two way cross tabulation summary, showing count of each categories. This is because logistic regression uses count data, much like in a non-parametric model.

Logistic Regression: Part I - Fundamentals

Likelihood theory

Probit models were the first of those being used to analyze non-normal data using non-linear models. In an early example of probit regression, Bliss(1934) describes an experiment in which nicotine is applied to aphids and the proportion killed is recorded. As an appendix to a paper Bliss wrote a year later (Bliss, 1935), Fisher (1935) outlines the use of maximum likelihood to obtain estimates of the probit model.