Group Informatics: Social Network Analysis for Learning Analytics

Social Network Analysis Workshop for LASI 2017

Download as .zip Download as .tar.gz View on GitHub

Get R Installed on A Computer You Use, including These Core Components

Getting Started

Get started in RStudio on your own computer.

RStudio is a bit easier to install and manage locally than Python, primarily because it is a fully self contained application that does not rely on your personal computers general configuration to the extent that Python does.

For one reason or another, it seems like students (besides weird superhuman computer engineering students) have a nightmarish impression of R. It’s really an amazing tool, with freely-available information and tutorials online. Once you get the basics down, R is happy to do almost all of the hard work for you.

On this page, you’ll learn how to install and use R, and RStudio — a wonderful, user-friendly workspace that is made specifically to work with R, and makes your life using R much more efficient and straightforward.

The steps that you can follow on this page are:

  1. Installing R
  2. Installing RStudio
  3. Exploring the RStudio environment
  4. Writing a simple script in RStudio
  5. Getting help in RStudio
  6. Loading data into RStudio
  7. Some basic functions for data analysis

Step 1. Install R

R is a free, independent, open-source platform for data analysis, statistics and graphing which uses the S programming language. All of that is really great, because there are almost limitless resources available online to help you if you get stuck (individuals, companies and institutions have set up R resources to provide tutorials, data packages, etc.). The R Project is progressing due to the work of contributors who grow the available documentation and capacities by adding new packages and functions.

You can download R at: http://www.r-project.org/

You’ll want to choose a CRAN Mirror close to your location, as instructed. Follow the steps at the link above to install R. At this point, you could start using R as is – however, I strongly recommend downloading RStudio to make your life easier.

Step 2. Install RStudio

Keep in mind that RStudio uses R – it is just a more user-friendly workspace that limits the code that you have to type in manually to make things happen. That means that you must install R for RStudio to work. Otherwise it’s just an empty shell.

Download and install the free open-source version of RStudio by visiting: www.rstudio.com

Step 3. Using RStudio – Exploring the Workspace

When you open RStudio, your beautiful workspace will appear, looking something like this:

image missing

Don’t panic. There’s a lot of text that you should read once (e.g. licensing, etc.), then not worry about. Let’s go through the different spaces in RStudio.

The CONSOLE WINDOW. This is that big space on the left, which initially contains all kinds of licensing text and some advice for getting help. Notice that there is a ‘>’ prompt at the bottom – that’s where you can type in commands (later on you’ll learn how to write a script instead of working in the console window). It’s also where you can see the outcome of your different data analyses (e.g. summaries of statistical tests, outcomes of calculations, etc.).

If the automated text gives you a headache, just go on up to Edit

Clear Console. Et voilà — a beautifully blank slate for you to start testing out.

So since we’re there, go ahead and try a few things. Assign some variables. You can do that by writing a variable name, followed by ‘<-‘ or ‘=’, then inputting a value or function describing the variable. Press Enter, and the variable is stored in the Workspace – R remembers that you created that variable so that it can be used later.

For example, try the following:

Create a variable called ‘Dragon,’ which has a value of 4, as follows:

Dragon <- 4

Notice that when you press ‘Enter,’ the variable name and corresponding value are stored in the Workspace tab.

Now create a variable called ‘Unicorn,’ which has a value of 6:

Unicorn <- 6

Again, pressing enter stores the ‘Unicorn’ variable in the workspace. Now, calculate the multiplicative power of a Dragicorn by creating a new variable ‘Dragicorn’ that is the multiple of the Dragon and Unicorn variables:

Dragicorn <- Dragon*Unicorn

You’ll see that in the Workspace, ‘Dragicorn’ appears with a value of 24, as expected, because Dragicorns are very magistic and powerful as depicted below.

image missing

The main thing to realize is that when you enter a variable in the console and press ‘Enter,’ it is immediately stored in the Workspace where you can look through all existing variables and their value or structure (e.g., you could have vectors and data frames also loaded – you’ll learn that a little later on).

Also notice that when you’re working in the console window, each line of code you enter actually runs every time you press enter. You’ll also learn how to use scripts to write an entire story of code without having to run each line separately.

The WORKSPACE and HISTORY TABS. The Workspace tab, as above, tells you what variables you have existing in your active RStudio environment. So that’s that. But there’s also a History tab up there. The History tab keeps a running log of each line of code that you run.

The FILES TAB. This is where you can choose to export things or save parts of your workspace. It’s where Files go. Moving on.

The PLOTS TAB. If you run a line of graphing code (e.g. the plot() function) in RStudio without a line of code directing RStudio to open the graphs in a new window, they will appear here. Multiple graphics can be stored in the plot window simultaneously, which you can scroll through.

The PACKAGES TAB. This one is important. When you’re working in R or RStudio, you’re getting a lot of help from many past and current users who have created packages containing code to perform certain tests or analyses. Functions are contained in packages, and — depending on the function you want to use — you may need to install the package in which it exists. Or, you may just need to select it from the list of packages that are installed but not loaded in your workspace.

The HELP TAB. This is where your helpful R documentation will come up when you need help using a function. You will learn how to get help a little further on down this page.

Step 4. Write a simple script

You can write and store all of your code just right in the console window, but you really don’t want to. It’s difficult to format, almost impossible to follow, and generally just becomes a mess when you need to create a legible multi-line piece of code. So don’t do that. Instead, write a script.

A script is a text-only version of your code, which doesn’t run until you tell it to. You can save it, edit it just like you would a text document, add comments, format it, open it again in a new session, open someone else’s script and run it on your computer…all things that are way better than running a line-by-line code in the console window.

So how do I create a script? First, you need to open a new space to create your script. Go to File » New » R Script. In your RStudio environment, you’ll see a new blank section appear — that’s where your script is going.

image missing

Try typing anything in the script window. Go ahead. Try it. Press enter. Notice that nothing happens. It works just like a text window until you tell RStudio that you want to run your code.

You can run all of the code simultaneously using the shortcut: Ctrl + Shift + Enter

Since this might be your first introduction to working with scripts in RStudio, here are a few pointers:

Include a descriptive header as a comment in EVERY script you write (use the # sign before each line if it is only a comment, and not a line of code that should be run) Comment thoroughly after each line of code to describe what you’re doing (you might be surprised how quickly you forget what you did) Format your code with spacing, indentation, etc. — this makes a world of difference when you are trying to edit and troubleshoot The text in your script window might not automatically wrap (which can be super obnoxious) — wrap text by: Tools » Options » Code Editing » Select “Soft-wrap R source files” That’s a lot of information. Let’s just look at an example – you can follow along by creating a simple script in your own RStudio environment.

Example: Creating an organized script

Here is an example script which simply calculates the miles traveled by multiplying the rate and time traveled. Create a script of your own in your script window.

image missing

Once you’ve completed your simple script, press: CTRL + SHIFT + ENTER to run all of it simultaneously!

Some things to notice:

While you were working on writing your script, notice that nothing is stored in the workspace when you press ‘Enter’ — that is because you aren’t actually running code. It’s only after you’ve told R to run the code that your variables will be stored in the workspace. If your text is wrapping, you do not need to put a ‘#’ in front of the wrapped line – as long as it’s in green, the code is a comment

You can reassign variables at any time. Just remember they won’t be stored (and rewritten) until you’ve run the code again. Congratulations on writing your first simple script!

Step 5. Ask R for help

One of the greatest things about R is that it’s free and open-source, which means that there is a TON of information about basically anything R-related online. Forums, documents, tutorials, etc. — Googling a certain function or method in R will almost always provide you with more information than you could want.

There is also, however, official R documentation built into the software that you can look to for help.

If you KNOW what function you want to use, type a question mark before the function name in the console window, then press enter to bring up the R documentation.

For example, if you type ?median into the console window, the R documentation will appear in the ‘Help’ tab of your RStudio environment. For the ‘median’ function help, the screen looks like this:

image missing

The components of the R help documents should be relatively self-explanatory. The description tells you what the function does, usage shows you how to use it, and arguments are the various components that can be included within the function. Beyond that, there is usually further information including examples for how to use the function.

But what if you DON’T know the exact function that you want to use?

If you’re not sure what an entire function is, type TWO question marks, followed by what you think the function starts with (or a part it contains). When you press enter, a list of possible functions will appear in the help tab, and you can select the appropriate function. For example, running ??med in the console window will bring up a number of possible options, of which you can look through and select ‘median.’

Step 6. Loading data into RStudio

The easiest — and most common — file format to load data into RStudio is as a comma-separated value (.csv) file. For example, you can prepare and organize your data in Excel, then save the active worksheet with your data as a .csv file to load into your working environment.

To make your life easier, you will want to simplify your data frame as much as possible before saving as a .csv and loading into RStudio. Here are some things to think about:

Is there text information in the Excel worksheet that can be isolated in a separate metadata worksheet? You’ll want to remove excess text and metadata before you save as a .csv for RStudio. Are there units in header columns that can be moved to the metadata worksheet? Simplify header columns as much as possible — avoid spaces and punctuation.

Can I simplify the row or column names and include the full title in a metadata worksheet?Simplify all headers, rows and column names, to one descriptive word or abbreviation if possible. Make sure to have a metadata sheet available elsewhere to reference your abbreviated titles.

Are the rows or columns organized in a way that will be convenient when I load the data into RStudio? Do you need to transpose the data? Can you remove unnecessary spaces? Are the rows and columns aligned correctly?

Does the Excel worksheet contain titles or headings that I need to move? Move them. Save a separate metadata worksheet in Excel. Are there already statistical calculations performed in the Excel worksheet that I will not include in the data when I load it into RStudio? Remove unnecessary data.

Will the file name (FileName.csv) be convenient to work with in RStudio? Keep file names simple — one word if possible. Avoid spaces and punctuation in file name. Alternatively, when you load the data into RStudio, you are given the option to rename the data frame — make it something simple at that point, if not before.

Here’s what you don’t want your data to look like when you load it into RStudio:

image missing

Instead, you want to SAVE ALL OF THAT IMPORTANT INFORMATION AS A METADATA FILE, but simplify for use as a .csv file in RStudio. For example, the following would be a great data frame to load intoRStudio:

image missing

So now, you have a very clean, simple looking data set that is going to be totally dreamy when you load it into RStudio and try to work with it instead of being an absolute nightmare. Congratulations.

Once you have your data cleaned up and saved as a .csv file (save the active worksheet if using Excel), it is straightforward to load into RStudio. Just click ‘Import Dataset’ in your RStudio workspace:

image missing

A new preview window will appear where you can change the name of the data frame (if you do, remember to keep it simple), change the separator (it should be ‘comma’ if you saved the file as a comma-separated value, or csv, file). Look over the preview and ensure that the data frame looks okay, then select ‘Import.’

Notice that the new data frame appears in a window (new tab, where you write scripts) and is loaded and stored in the workspace. Now you have something you can work with in RStudio, all loaded and ready for analysis.

Step 7. Basic Functions for Data Analysis

So, you have your data loaded into RStudio. You know how to organize and put together a simple script. Let’s explore some basic, useful functions for data analysis in RStudio with a data frame.

Here, you will:

Load a dataset into RStudio Find summary statistics of data in a data frame Perform basic calculations for a single column Calculate the mean and standard deviations for multiple columns Create a simple scatterplot Write a script to perform calculations and create a graph This is a very basic introduction to several frequently used functions. To learn how to perform specific statistical tests (e.g. t-tests, regression, ANOVA, etc.) in RStudio, visit the Pick a Methodpages for examples and instructional documents.

First, open the following .xls file so that you can follow along: image missing

The data are mock weights (pounds) for five puppies from a single litter in their first 10 days. Save the worksheet as a .csv file (PuppyWeights.csv), then load into RStudio as described above (make sure that RStudio knows that there are headings in your dataset).

Your RStudio environment should now contain the PuppyWeights data frame, and may look like this:

image missing

Let’s try to do a few things with this data. Starting with finding some information about the data frame itself.

Explore your data in RStudio

If you just want to find some information about your data, try the following functions in the console window:

head(DatasetName) # Reports first 6 lines of data frame

tail(DatasetName) # Reports last 6 lines of data frame

View(DatasetName) # Reports entire data frame (CAPITALS MATTER!)

summary(DatasetName) #Reports summary statistics (mean, quartiles, min, max) for each column of data in the dataset

For example, using the head() function with the ‘PuppyWeights’ dataset yields:

image missing

Perform analysis on a single data column: the ‘$’ sign to call a column

Notice that each of the functions above acts on the entire data frame — because within the parentheses, you have not indicated any single column to isolate. To perform a function on a single column, use the dollar sign ($) between the dataset name and the column name to “call” the individual column.

For example, to find the range of weights for Puppy 2, you could use the range() function and indicate that you are interested in the ‘Puppy2’ column of the ‘PuppyWeights’ dataset. Like this:

image missing

Which tells you that the range of weights for the Puppy 2 data is from 0.5 to 1.2 pounds. You could similarly calculate the mean, standard deviation, variance, minimum, maximum or other summary statistics for individual columns.

Calculate summary statistics (mean, standard deviation) for multiple columns or rows using the apply() function

There are a number of ways to calculate summary statistics for multiple columns or rows in a data frame. Here, we will use the apply() function:

The apply() function

The apply() function allows you to specify and perform a function to use over a range of columns or rows in a data frame. Remember — if you forget how to use a function use ‘?’ before the function name to bring up the R documentation in the Help tab.

The apply function works as follows:

apply(WhatData?, RowsorColumns?, WhichFunction?, …OptionalArguments)

Which still might take some explaining. So, in the first argument (shown as ‘WhatData?’ above), you need to tell the apply function which data you want to use. This can include a whole dataset, or just parts of it (for example, a range of columns or rows). The second arguments indicates whether to perform an argument over rows (insert 1 for rows), columns (insert 2 for columns). The third argument is where you tell the apply function what you want to do…for example, calculate a mean, standard deviation, sum, etc.

Several examples using the ‘apply()’ function are shown below. Note that, for simplicity, these are typed directly into the console window. In practice, you’ll want to include them as part of your script instead.

Example 1: Find the maximum value of all columns in PuppyWeights. Notice that even the maximum value in the ‘Day’ column is reported.

image missing

Example 2: Find the maximum value of columns 2 – 6 (exclude the ‘Day’ column) in ‘PuppyWeights’ dataset.

image missing

Example 3: Find the mean puppy weight across ROWS for each day, excluding the values in the ‘Day’ column.

image missing

Example 4: Find the standard deviation (use the ‘sd’ function) across ROWS for each day, excluding values in the ‘Day’ column.

image missing

You may also be interested in the following functions: colMeans(), rowMeans(), aggregate()

Create a simple scatterplot

There are a number of ways to do most things in R, including making a scatterplot with continuous data. Here, you will learn to use the plot() function. Remember to use ?plot to read the Rdocumentation if you get stuck.

The plot function works as follows:

plot(xvalues, yvalues, …optional commands)

The only required arguments are the values of the independent variable (first argument), and the corresponding values of the dependent variable (second argument). Beyond that, you can add axis labels (with xlab and ylab), a main title, a trendline, etc. See several examples below.

Example 1. Scatterplot with only x and y arguments.

image missing

image missing

Example 2. Scatterplot with many optional arguments (see if you can figure out what they do by exploring with your own code!)

image missing

image missing

So now you’ve had a short introduction to some skills in RStudio. Since it’s all open access, you can look up any information you could possibly want about methods, examples and packages in R. If you are comfortable with what you’ve learned on this page, you should be fine following along with the various examples and documents in the Pick a Method pages to perform actual statistical analyses with your data.