This guide will get you started on the path to exploring and visualizing your own data with the R programming language. It introduces you to the tidyverse which is a collection of data science tools within R for transforming and visualizing data. This is not the only set of tools in R, but it's a powerful and popular approach for exploring data. At every step, you'll be analyzing a real dataset called gapminder.
Gapminder tracks economic and social indicators like life expectancy and the GDP per capita of countries over time. The experience you gain on this example will help you in analyzing your own data. You'll learn to draw specific insights and communicate them through informative visualizations with the ggplot2 package.
The first code you'll write is to load two R packages, which is done by writing library(packagename). R packages are tools that aren't built into the language, but were created later by other programmers. Each of them provides tools that you don't have to write yourself. The first package is gapminder, created by Jenny Bryan, which contains the dataset that you'll be analyzing. The second package is dplyr, created by Hadley Wickham, which provides step-by-step tools for transforming this data, such as filtering, sorting, and summarizing it.
You type library(gapminder) to display the contents of the gapminder object, which is structured as a data frame. A data frame keeps rectangular data in rows and columns, similar to a spreadsheet, or a table in a SQL database. Most data analyses in R, and everything you'll do in this guide, are centered around data frames.
As described in the first line of the output, this is a special type of data frame called a tibble. R displays the first ten rows so that you can get a glimpse of it, and you can see a short description in the first line. This tells you the tibble has one thousand seven hundred and four rows, each of which we call an observation. It has six columns, each of which we call a variable.
It's important in an analysis to understand what each observation, or row, represents. Here, each represents a unique pair of a country and a year. For example,
- the first observation represents country statistics for Afghanistan in 1952,
- the second for Afghanistan in 1957, and so on.
For each combination of a country and year, the dataset contains several variables, or columns, describing the country's demographics. We see the continent - in this case, Asia - the life expectancy in years, the population, and the GDP per capita. The GDP per capita is the country's total economic output (Gross Domestic Product) divided by its population, and it's a common measure of how wealthy a country is.
Each variable is of one consistent data type: some are numbers, like life expectancy and population, and some are categorical, like country and continent. Even with this small glimpse of the data, you can extract a few insights. For example, you can see that Afghanistan's life expectancy and population have both gone up from 1952 to 1997, but that its GDP per capita has wavered. In the rest of this guide, you'll learn to use R to draw many conclusions about the social and economic history of countries around the world.
Loading the gapminder and dplyr packages
Before you can work with the gapminder
dataset, you'll need to load two R packages that contain the tools for working with it, then display the gapminder
dataset so that you can see what it contains.
Exercise 1:
Use the library()
function to load the dplyr
package and the gapminder
package.
Type gapminder
, on its own line, to look at the gapminder dataset.
Part 1 ends here. Stay tuned for part 2
Cheers!
Comments
Post a Comment
Your input is valued. Please type something....