Introduction
This is a blog post to expand on a talk I gave at the Manchester R Users Group on 21st February 2017. In it I give a brief overview of the tidyverse and its core concepts, before going on to discuss how the same concepts can be applied to bioinformatics analysis using Bioconductor classes and packages.
What is the tidyverse?
The tidyverse is a suite of tools primarily developed by Hadley Wickham and collaborators at RStudio. The suite of packages can be conveniently installed as follows:
install.packages('tidyverse')
There are many talks and resources that go into much greater depth than I do here including:
- Managing many models talk on YouTube by Hadley Wickham
- R for Data Science book and website
- Recorded tutorials and presentations from RStudio::conf 2017
- Webinars on RStudio website
- DataCamp courses on dplyr and the tidyverse
Key concepts in the tidyverse
Tidy data
They key concept of the tidyverse is that of tidy data. If you’re from a database background then think of this as normalised data. For example, the table below is NOT tidy data since the year variable is encoded as a column name:
library(tidyverse)
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
To turn this into tidy data with one observation per row, we can use the tidyr
package:
df <- gather(table4a, year, cases, -country)
df
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Tibbles
Tibbles are an alternative to the data frame. The key differences are:
- Subsetting a tibble will always produce another tibble.
- A tibble will only print to screen - no more pages of output…
dplyr::filter(df, cases == 'Nigeria')
## # A tibble: 0 x 3
## # ... with 3 variables: country <chr>, year <chr>, cases <int>
dplyr::select(df, country)
## # A tibble: 6 x 1
## country
## <chr>
## 1 Afghanistan
## 2 Brazil
## 3 China
## 4 Afghanistan
## 5 Brazil
## 6 China
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.4.2
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
dplyr and the pipe - %>%
The dplyr
package provides an intuitive way of manipulating data frames that will come naturally to people from a database background who are used to SQL. Furthermore, the pipe operator allows function calls to be strung together sequentially rather than nested which improves readability. For example, the statements below:
# base r
gapminder[gapminder$country=='China', c('country', 'continent', 'year', 'lifeExp')]
# dplyr
dplyr::select(dplyr::filter(gapminder, country=='China'), country, continent, year, lifeExp)
Give the same output as this:
# pipe and dplyr
gapminder %>%
dplyr::filter(country=='China') %>%
dplyr::select(country, continent, year, lifeExp)
## # A tibble: 12 x 4
## country continent year lifeExp
## <fctr> <fctr> <int> <dbl>
## 1 China Asia 1952 44.00000
## 2 China Asia 1957 50.54896
## 3 China Asia 1962 44.50136
## 4 China Asia 1967 58.38112
## 5 China Asia 1972 63.11888
## 6 China Asia 1977 63.96736
## 7 China Asia 1982 65.52500
## 8 China Asia 1987 67.27400
## 9 China Asia 1992 68.69000
## 10 China Asia 1997 70.42600
## 11 China Asia 2002 72.02800
## 12 China Asia 2007 72.96100
The RStudio Data Wrangling cheatsheet provides a useful resource for getting to know dplyr.
purrr and its map function
The purrr
package provides a number of functions to make R more consistent and programming friendly than the equivalent base R functions. At first it’s map
function seems like a reimplementation of apply
:
map(c(4, 9, 16), sqrt)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
But map
will always return a list, and there are other members of the map_
family which will always return a certain data type:
map_dbl(c(4, 9, 16), sqrt)
## [1] 2 3 4
map_chr(c(4, 9, 16), sqrt)
## [1] "2.000000" "3.000000" "4.000000"
List-cols in data frames
List-cols are a powerful concept that underpin the later part of this tutorial. If data frames are explained as a way to keep character and number vectors together, then list-cols in tibbles takes this a step further and allow lists of any object to be kept together. For example, we can create nested data frames where we have a data frame within a data frame:
gm_nest <- gapminder %>% group_by(country, continent) %>% nest()
gm_nest
## # A tibble: 142 x 3
## country continent data
## <fctr> <fctr> <list>
## 1 Afghanistan Asia <tibble [12 x 4]>
## 2 Albania Europe <tibble [12 x 4]>
## 3 Algeria Africa <tibble [12 x 4]>
## 4 Angola Africa <tibble [12 x 4]>
## 5 Argentina Americas <tibble [12 x 4]>
## 6 Australia Oceania <tibble [12 x 4]>
## 7 Austria Europe <tibble [12 x 4]>
## 8 Bahrain Asia <tibble [12 x 4]>
## 9 Bangladesh Asia <tibble [12 x 4]>
## 10 Belgium Europe <tibble [12 x 4]>
## # ... with 132 more rows
Since the data column is a list column (list-col) we can access it as we would expect from a list. For example, to get the data for the second row, Albania:
gm_nest$data[[2]]
## # A tibble: 12 x 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1952 55.230 1282697 1601.056
## 2 1957 59.280 1476505 1942.284
## 3 1962 64.820 1728137 2312.889
## 4 1967 66.220 1984060 2760.197
## 5 1972 67.690 2263554 3313.422
## 6 1977 68.930 2509048 3533.004
## 7 1982 70.420 2780097 3630.881
## 8 1987 72.000 3075321 3738.933
## 9 1992 71.581 3326498 2497.438
## 10 1997 72.950 3428038 3193.055
## 11 2002 75.651 3508512 4604.212
## 12 2007 76.423 3600523 5937.030
Using map to manipulate list-cols
Since this list column is just a list, we can use map to count the number of rows in each data frame in the list:
map_dbl(gm_nest[1:10,]$data, nrow)
## [1] 12 12 12 12 12 12 12 12 12 12
Rather than providing a ready to use function we can also define our own:
map_dbl(gm_nest[1:10,]$data, function(x) nrow(x))
## [1] 12 12 12 12 12 12 12 12 12 12
There is also a shorthand formula syntax for this where the period .
represents the element of the list that is being processed:
map_dbl(gm_nest[1:10,]$data, ~nrow(.))
## [1] 12 12 12 12 12 12 12 12 12 12
Defining our own function allows us to reach into the data frame and perform an operation, such as summarise one of the columns:
map_dbl(gm_nest[1:10,]$data, ~round(mean(.$lifeExp),1))
## [1] 37.5 68.4 59.0 37.9 69.1 74.7 73.1 65.6 49.8 73.6
Importantly we can also do this at the level of the parent tibble and store the output in the data frame:
gm_nest %>% mutate(avg_lifeExp=map_dbl(data, ~round(mean(.$lifeExp),1)))
## # A tibble: 142 x 4
## country continent data avg_lifeExp
## <fctr> <fctr> <list> <dbl>
## 1 Afghanistan Asia <tibble [12 x 4]> 37.5
## 2 Albania Europe <tibble [12 x 4]> 68.4
## 3 Algeria Africa <tibble [12 x 4]> 59.0
## 4 Angola Africa <tibble [12 x 4]> 37.9
## 5 Argentina Americas <tibble [12 x 4]> 69.1
## 6 Australia Oceania <tibble [12 x 4]> 74.7
## 7 Austria Europe <tibble [12 x 4]> 73.1
## 8 Bahrain Asia <tibble [12 x 4]> 65.6
## 9 Bangladesh Asia <tibble [12 x 4]> 49.8
## 10 Belgium Europe <tibble [12 x 4]> 73.6
## # ... with 132 more rows
The power of list-cols
We could have summarised the gapminder data entirely in dplyr without worrying about list-cols and nested data frames:
gapminder %>%
group_by(country) %>%
summarise(avg_lifeExp=round(mean(lifeExp),1))
## # A tibble: 142 x 2
## country avg_lifeExp
## <fctr> <dbl>
## 1 Afghanistan 37.5
## 2 Albania 68.4
## 3 Algeria 59.0
## 4 Angola 37.9
## 5 Argentina 69.1
## 6 Australia 74.7
## 7 Austria 73.1
## 8 Bahrain 65.6
## 9 Bangladesh 49.8
## 10 Belgium 73.6
## # ... with 132 more rows
However, by using map
rather than map_dbl
we can create new list-cols, which can contain lists of any object that we want. In the example below we create a plot for each country, then display the first plot for Afghanistan:
gm_plots <- gm_nest %>%
mutate(plot=map(data, ~qplot(x=year, y=lifeExp, data=.)))
gm_plots
## # A tibble: 142 x 4
## country continent data plot
## <fctr> <fctr> <list> <list>
## 1 Afghanistan Asia <tibble [12 x 4]> <S3: gg>
## 2 Albania Europe <tibble [12 x 4]> <S3: gg>
## 3 Algeria Africa <tibble [12 x 4]> <S3: gg>
## 4 Angola Africa <tibble [12 x 4]> <S3: gg>
## 5 Argentina Americas <tibble [12 x 4]> <S3: gg>
## 6 Australia Oceania <tibble [12 x 4]> <S3: gg>
## 7 Austria Europe <tibble [12 x 4]> <S3: gg>
## 8 Bahrain Asia <tibble [12 x 4]> <S3: gg>
## 9 Bangladesh Asia <tibble [12 x 4]> <S3: gg>
## 10 Belgium Europe <tibble [12 x 4]> <S3: gg>
## # ... with 132 more rows
gm_plots$plot[[1]]
A complete example
In this example we combine the approaches described so far to:
- Nest the data
- Fit a linear regression model for each country
- Extract the slope and r2 value for each country
- Plot all countries together
gapminder_analysis <- gapminder %>%
group_by(country, continent) %>%
nest() %>%
mutate(model=map(data, ~lm(lifeExp ~ year, data=.)),
slope=map_dbl(model, ~coef(.)['year']),
r2=map_dbl(model, ~broom::glance(.)$`r.squared`))
gapminder_analysis
## # A tibble: 142 x 6
## country continent data model slope r2
## <fctr> <fctr> <list> <list> <dbl> <dbl>
## 1 Afghanistan Asia <tibble [12 x 4]> <S3: lm> 0.2753287 0.9477123
## 2 Albania Europe <tibble [12 x 4]> <S3: lm> 0.3346832 0.9105778
## 3 Algeria Africa <tibble [12 x 4]> <S3: lm> 0.5692797 0.9851172
## 4 Angola Africa <tibble [12 x 4]> <S3: lm> 0.2093399 0.8878146
## 5 Argentina Americas <tibble [12 x 4]> <S3: lm> 0.2317084 0.9955681
## 6 Australia Oceania <tibble [12 x 4]> <S3: lm> 0.2277238 0.9796477
## 7 Austria Europe <tibble [12 x 4]> <S3: lm> 0.2419923 0.9921340
## 8 Bahrain Asia <tibble [12 x 4]> <S3: lm> 0.4675077 0.9667398
## 9 Bangladesh Asia <tibble [12 x 4]> <S3: lm> 0.4981308 0.9893609
## 10 Belgium Europe <tibble [12 x 4]> <S3: lm> 0.2090846 0.9945406
## # ... with 132 more rows
ggplot(gapminder_analysis, aes(y=slope, colour=r2, x=continent)) +
geom_point(position=position_jitter(w=0.2)) +
theme_bw()
We can see that countries in Asia have has a faster increase in lifespan than those in Europe, whilst for Africa the picture is rather more mixed with some countries increasing and others staying the same or reducing and fitting more poorly to the linear regression model. Further analysis shows that the reasons for this include genocide and the HIV/AIDS epidemic.
This example demonstrates how powerful analyses can be carried out with very concise and non-repetitive code using tidyverse principles.
The tidyverse for Bioinformatics
What is Bioconductor?
Bioconductor is a suite of R packages and classes. This allows biological data to be analysed and stored in efficient and consistent ways. Analyses such as RNAseq analysis can also be controlled effectively using the list-cols framework.
A typical RNA-seq workflow
RNA-sequencing is carried out to quantitate the amount of a gene present in a set of samples. This can be done for all genes in the human genome simulatenously through the power of DNA sequencing. We then ask which genes are differentially expressed - ie are present in a different abundance in one set of samples (eg from a tumour) than another (eg normal). A typical RNA-seq experiment will have the following workflow:
- Align and count reads to form a SummarizedExperiment object
- Filter out genes with a low signal
- Specify a design formula to do the differential expression
- Specify a log2 fold change threshold to test for differentially expressed genes
- Plot the results
An RNA-seq analysis
First load the SummarizedExperiment object and explore it:
#adapted from https://f1000research.com/articles/4-1070/v2
library(airway)
library(DESeq2)
data(airway)
se <- airway
se
## class: RangedSummarizedExperiment
## dim: 64102 8
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(se)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
colnames(assay(se))
## [1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516"
## [6] "SRR1039517" "SRR1039520" "SRR1039521"
rownames(assay(se))[1:10]
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419"
## [4] "ENSG00000000457" "ENSG00000000460" "ENSG00000000938"
## [7] "ENSG00000000971" "ENSG00000001036" "ENSG00000001084"
## [10] "ENSG00000001167"
assay(se)[1:5,1:5]
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
We can count up the number of reads per sample using the colSums
function, or using purrr
:
colSums(assay(se))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## 20637971 18809481 25348649 15163415 24448408 30818215
## SRR1039520 SRR1039521
## 19126151 21164133
purrr::map_dbl(colnames(assay(se)), ~sum(assay(se)[,.]))
## [1] 20637971 18809481 25348649 15163415 24448408 30818215 19126151 21164133
We then create a DESeqDataSet
object and set the design formula
dds <- DESeqDataSet(se, design = ~ cell + dex)
Get rid of genes with few reads
dds <- dds[ rowSums(counts(dds)) > 1, ]
Re-count the library size:
dds <- estimateSizeFactors(dds)
Do a Principal Compnent Analysis to understand the data structure:
rld <- vst(dds, blind=FALSE)
pca_plot <- plotPCA(rld, intgroup = c("dex", "cell"))
pca_plot
Do the differential expression analysis and extract the results:
dds <- DESeq(dds, quiet = TRUE)
res <- results(dds, lfcThreshold=0, tidy = TRUE) %>% tbl_df()
res
## # A tibble: 29,391 x 7
## row baseMean log2FoldChange lfcSE stat
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000000003 708.6021697 0.38125397 0.10065597 3.7876937
## 2 ENSG00000000419 520.2979006 -0.20681259 0.11222180 -1.8428915
## 3 ENSG00000000457 237.1630368 -0.03792034 0.14345322 -0.2643394
## 4 ENSG00000000460 57.9326331 0.08816367 0.28716771 0.3070111
## 5 ENSG00000000938 0.3180984 1.37822703 3.49987280 0.3937935
## 6 ENSG00000000971 5817.3528677 -0.42640216 0.08831006 -4.8284666
## 7 ENSG00000001036 1282.1063855 0.24107123 0.08871987 2.7172180
## 8 ENSG00000001084 609.8920919 0.04761687 0.16665615 0.2857192
## 9 ENSG00000001167 369.3428078 0.50036451 0.12088513 4.1391733
## 10 ENSG00000001460 183.2376742 0.12389881 0.17991227 0.6886624
## # ... with 29,381 more rows, and 2 more variables: pvalue <dbl>,
## # padj <dbl>
Plot the results as an MA plot with abundance of the gene on the x-axis and fold change on the y-axis. Colour represents significance of the statistical test.
ma_plot <- ggplot(res, aes(log2(baseMean), log2FoldChange, colour=padj<0.1)) +
geom_point(size=rel(0.5), aes(text=row)) + theme_bw()
## Warning: Ignoring unknown aesthetics: text
ma_plot
We can look at the top differentially expressed gene to see that there is indeed a difference between treated and untreated:
topGene <- res %>% dplyr::arrange(padj) %>%
dplyr::slice(1) %>% dplyr::select(row) %>% unlist() %>% unname()
plotCounts(dds, gene=topGene, intgroup=c("dex"))
RNA-seq in the tidyverse
In the workflow there were a number of parameters that we might wish to vary:
- the design formula
- the log2 fold change threshold (
lfcThreshold
) - the minimum read count
We can set up a control data frame using the tidyr::crossing
function which contains 1 row per combination of parameters, and then add in the SummarizedExperiment object and a row number:
control_df <- tidyr::crossing(formula=c("~ cell + dex", "~ dex"),
lfcThreshold=c(0,1),
min_count=c(1,5))
control_df <- control_df %>% mutate(rn=row_number(),
se=map(rn, ~se))
control_df
## # A tibble: 8 x 5
## formula lfcThreshold min_count rn
## <chr> <dbl> <dbl> <int>
## 1 ~ cell + dex 0 1 1
## 2 ~ cell + dex 0 5 2
## 3 ~ cell + dex 1 1 3
## 4 ~ cell + dex 1 5 4
## 5 ~ dex 0 1 5
## 6 ~ dex 0 5 6
## 7 ~ dex 1 1 7
## 8 ~ dex 1 5 8
## # ... with 1 more variables: se <list>
Using map2
we can execute the DESeqDataSet
function for each row taking the design formula and SummarizedExperiment object as inputs:
results_df <- control_df %>%
mutate(dds=map2(se, formula, ~DESeqDataSet(.x, design=as.formula(.y))))
results_df
## # A tibble: 8 x 6
## formula lfcThreshold min_count rn
## <chr> <dbl> <dbl> <int>
## 1 ~ cell + dex 0 1 1
## 2 ~ cell + dex 0 5 2
## 3 ~ cell + dex 1 1 3
## 4 ~ cell + dex 1 5 4
## 5 ~ dex 0 1 5
## 6 ~ dex 0 5 6
## 7 ~ dex 1 1 7
## 8 ~ dex 1 5 8
## # ... with 2 more variables: se <list>, dds <list>
At this point it is worth sanity checking that we have done what we wanted to do by comparing the input formula with that extracted from the DESeqDataSet:
results_df$formula
## [1] "~ cell + dex" "~ cell + dex" "~ cell + dex" "~ cell + dex"
## [5] "~ dex" "~ dex" "~ dex" "~ dex"
map_chr(results_df$dds, ~design(.) %>% as.character() %>% paste(collapse=' '))
## [1] "~ cell + dex" "~ cell + dex" "~ cell + dex" "~ cell + dex"
## [5] "~ dex" "~ dex" "~ dex" "~ dex"
Then finish off preparing the DESeqDataset object by removing genes below a certain read count:
results_df <- results_df %>%
mutate(dds=map2(dds, min_count, ~.x[ rowSums(counts(.x)) > .y , ]),
dds=map(dds, estimateSizeFactors))
And then doing the differential expression analysis itself and extracting the results:
results_df <- results_df %>%
mutate(dds=map(dds, ~DESeq(., quiet=TRUE)),
res=map2(dds, lfcThreshold, ~results(.x, lfcThreshold=.y, tidy = TRUE) %>% tbl_df()))
Viewing the results
At this point we can add a plot object to the mix like we did for the gapminder example. We define a plotting function to make the analysis more readable:
#define a plot function
my_plot <- function(df, pthresh) {
ggplot(df, aes(log2(baseMean), log2FoldChange, colour=padj<pthresh)) +
geom_point(size=rel(0.5)) +
theme_bw()
}
#make the plots
plots_df <- results_df %>%
mutate(ma_plot=map2(res, rn, ~my_plot(.x, 0.1) + ggtitle(.y)))
plots_df
## # A tibble: 8 x 8
## formula lfcThreshold min_count rn
## <chr> <dbl> <dbl> <int>
## 1 ~ cell + dex 0 1 1
## 2 ~ cell + dex 0 5 2
## 3 ~ cell + dex 1 1 3
## 4 ~ cell + dex 1 5 4
## 5 ~ dex 0 1 5
## 6 ~ dex 0 5 6
## 7 ~ dex 1 1 7
## 8 ~ dex 1 5 8
## # ... with 4 more variables: se <list>, dds <list>, res <list>,
## # ma_plot <list>
We can then display the plots from the ma_plot list-col:
cowplot::plot_grid(plotlist=plots_df$ma_plot[1:4])
Or we can filter the data frame and just show some plots, for example those where the minimum read count was 1:
plots_df %>%
dplyr::filter(min_count==1) %>%
dplyr::select(ma_plot) %>%
.$ma_plot %>%
cowplot::plot_grid(plotlist=.)
Limitations
Although the list-col approach provides useful framework for this type of analysis it does have some limitations:
- All of the data and results have to be held in the memory of one session.
- Can end up with multiple copies of data and very large data frames - models and plots will often contain the same data within the R object.
- Operations on a normal list can be easily parallelised using parallel::parLapply for example. This is more difficult using the list-col approach although one way is to split the tibble into a list of tibbles, and then parallelise across this.
- It can also be hard to wrap your brain around the necessary levels of abstraction!
Conclusion
Storing and manipulating objects in a data frame is a powerful and useful framework for an analysis since it keeps related things together and avoids code repetition and bloat. Analyses are parameterised which makes it very easy to add new values or change existing onces without copying and pasting big bits of code or having huge complicated functions. Although in this post it is applied to bioinformatics it could be applied to anything where work is done in specialised R object classes.
There is lots of work being done in the R community to extend these concepts, so keep an eye out for online presentations and tutorials!!