Note: You can download all workshop materials here, or visit kateto.net/netscix2016. |
This tutorial covers basics of network analysis and visualization with the R package igraph (maintained by Gabor Csardi and Tamas Nepusz). The igraph library provides versatile options for descriptive network analysis and visualization in R, Python, and C/C++. This workshop will focus on the R implementation. You will need an R installation, and RStudio. You should also install the latest version of igraph
for R:
install.packages("igraph")
Before we start working with networks, we will go through a quick introduction/reminder of some simple tasks and principles in R.
You can assign a value to an object using assign()
, <-
, or =
.
x <- 3 # Assignment
x # Evaluate the expression and print result
y <- 4 # Assignment
y + 5 # Evaluation, y remains 4
z <- x + 17*y # Assignment
z # Evaluation
rm(z) # Remove z: deletes the object.
z # Error!
We can use the standard operators <
, >
, <=
, >=
, ==
(equality) and !=
(inequality). Comparisons return Boolean values: TRUE
or FALSE
(often abbreviated to just T
and F
).
2==2 # Equality
2!=2 # Inequality
x <= y # less than or equal: "<", ">", and ">=" also work
Special constants include:
# NA - missing or undefined data
5 + NA # When used in an expression, the result is generally NA
is.na(5+NA) # Check if missing
# NULL - an empty object, e.g. a null/empty list
10 + NULL # use returns an empty object (length zero)
is.null(NULL) # check if NULL
Inf and -Inf represent positive and negative infinity. They can be returned by mathematical operations like division of a number by zero:
5/0
is.finite(5/0) # Check if a number is finite (it is not).
NaN (Not a Number) - the result of an operation that cannot be reasonably defined, such as dividing zero by zero.
0/0
is.nan(0/0)
Vectors can be constructed by combining their elements with the important R function c()
.
v1 <- c(1, 5, 11, 33) # Numeric vector, length 4
v2 <- c("hello","world") # Character vector, length 2 (a vector of strings)
v3 <- c(TRUE, TRUE, FALSE) # Logical vector, same as c(T, T, F)
Combining different types of elements in one vector will coerce the elements to the least restrictive type:
v4 <- c(v1,v2,v3,"boo") # All elements turn into strings
Other ways to create vectors include:
v <- 1:7 # same as c(1,2,3,4,5,6,7)
v <- rep(0, 77) # repeat zero 77 times: v is a vector of 77 zeroes
v <- rep(1:3, times=2) # Repeat 1,2,3 twice
v <- rep(1:10, each=2) # Repeat each element twice
v <- seq(10,20,2) # sequence: numbers between 10 and 20, in jumps of 2
v1 <- 1:5 # 1,2,3,4,5
v2 <- rep(1,5) # 1,1,1,1,1
Check the length of a vector:
length(v1)
length(v2)
Element-wise operations:
v1 + v2 # Element-wise addition
v1 + 1 # Add 1 to each element
v1 * 2 # Multiply each element by 2
v1 + c(1,7) # This doesn't work: (1,7) is a vector of different length
Mathematical operations:
sum(v1) # The sum of all elements
mean(v1) # The average of all elements
sd(v1) # The standard deviation
cor(v1,v1*5) # Correlation between v1 and v1*5
Logical operations:
v1 > 2 # Each element is compared to 2, returns logical vector
v1==v2 # Are corresponding elements equivalent, returns logical vector.
v1!=v2 # Are corresponding elements *not* equivalent? Same as !(v1==v2)
(v1>2) | (v2>0) # | is the boolean OR, returns a vector.
(v1>2) & (v2>0) # & is the boolean AND, returns a vector.
(v1>2) || (v2>0) # || is the boolean OR, returns a single value
(v1>2) && (v2>0) # && is the boolean AND, ditto
Vector elements:
v1[3] # third element of v1
v1[2:4] # elements 2, 3, 4 of v1
v1[c(1,3)] # elements 1 and 3 - note that your indexes are a vector
v1[c(T,T,F,F,F)] # elements 1 and 2 - only the ones that are TRUE
v1[v1>3] # v1>3 is a logical vector TRUE for elements >3
Note that the indexing in R starts from 1
, a fact known to confuse and upset people used to languages that index from 0
.
To add more elements to a vector, simply assign them values.
v1[6:10] <- 6:10
We can also directly assign the vector a length:
length(v1) <- 15 # the last 5 elements are added as missing data: NA
Factors are used to store categorical data.
eye.col.v <- c("brown", "green", "brown", "blue", "blue", "blue") #vector
eye.col.f <- factor(c("brown", "green", "brown", "blue", "blue", "blue")) #factor
eye.col.v
## [1] "brown" "green" "brown" "blue" "blue" "blue"
eye.col.f
## [1] brown green brown blue blue blue
## Levels: blue brown green
R will identify the different levels of the factor - e.g. all distinct values. The data is stored internally as integers - each number corresponding to a factor level.
levels(eye.col.f) # The levels (distinct values) of the factor (categorical var)
## [1] "blue" "brown" "green"
as.numeric(eye.col.f) # As numeric values: 1 is blue, 2 is brown, 3 is green
## [1] 2 3 2 1 1 1
as.numeric(eye.col.v) # The character vector can not be coerced to numeric
## Warning: NAs introduced by coercion
## [1] NA NA NA NA NA NA
as.character(eye.col.f)
## [1] "brown" "green" "brown" "blue" "blue" "blue"
as.character(eye.col.v)
## [1] "brown" "green" "brown" "blue" "blue" "blue"
A matrix is a vector with dimensions:
m <- rep(1, 20) # A vector of 20 elements, all 1
dim(m) <- c(5,4) # Dimensions set to 5 & 4, so m is now a 5x4 matrix
Creating a matrix using matrix():
m <- matrix(data=1, nrow=5, ncol=4) # same matrix as above, 5x4, full of 1s
m <- matrix(1,5,4) # same matrix as above
dim(m) # What are the dimensions of m?
## [1] 5 4
Creating a matrix by combining vectors:
m <- cbind(1:5, 5:1, 5:9) # Bind 3 vectors as columns, 5x3 matrix
m <- rbind(1:5, 5:1, 5:9) # Bind 3 vectors as rows, 3x5 matrix
Selecting matrix elements:
m <- matrix(1:10,10,10)
m[2,3] # Matrix m, row 2, column 3 - a single cell
m[2,] # The whole second row of m as a vector
m[,2] # The whole second column of m as a vector
m[1:2,4:6] # submatrix: rows 1 and 2, columns 4, 5 and 6
m[-1,] # all rows *except* the first one
Other operations with matrices:
# Are elements in row 1 equivalent to corresponding elements from column 1:
m[1,]==m[,1]
# A logical matrix: TRUE for m elements >3, FALSE otherwise:
m>3
# Selects only TRUE elements - that is ones greater than 3:
m[m>3]
t(m) # Transpose m
m <- t(m) # Assign m the transposed m
m %*% t(m) # %*% does matrix multiplication
m * m # * does element-wise multiplication
Arrays are used when we have more than 2 dimensions. We can create them using the array()
function:
a <- array(data=1:18,dim=c(3,3,2)) # 3d with dimensions 3x3x2
a <- array(1:18,c(3,3,2)) # the same array
Lists are collections of objects. A single list can contain all kinds of elements - character strings, numeric vectors, matrices, other lists, and so on. The elements of lists are often named for easier access.
l1 <- list(boo=v1,foo=v2,moo=v3,zoo="Animals!") # A list with four components
l2 <- list(v1,v2,v3,"Animals!")
Create an empty list:
l3 <- list()
l4 <- NULL
Accessing list elements:
l1["boo"] # Access boo with single brackets: this returns a list.
l1[["boo"]] # Access boo with double brackets: this returns the numeric vector
l1[[1]] # Returns the first component of the list, equivalent to above.
l1$boo # Named elements can be accessed with the $ operator, as with [[]]
Adding more elements to a list:
l3[[1]] <- 11 # add an element to the empty list l3
l4[[3]] <- c(22, 23) # add a vector as element 3 in the empty list l4.
Since we added element 3 to the list l4
above, elements 1 and 2 will be generated and empty (NULL).
l1[[5]] <- "More elements!" # The list l1 had 4 elements, we're adding a 5th here.
l1[[8]] <- 1:11
We added an 8th element, but not 6th and 7th to the listl1
above. Elements number 6 and 7 will be created empty (NULL).
l1$Something <- "A thing" # Adds a ninth element - "A thing", named "Something"
The data frame is a special kind of list used for storing dataset tables. Think of rows as cases, columns as variables. Each column is a vector or factor.
Creating a dataframe:
dfr1 <- data.frame( ID=1:4,
FirstName=c("John","Jim","Jane","Jill"),
Female=c(F,F,T,T),
Age=c(22,33,44,55) )
dfr1$FirstName # Access the second column of dfr1.
## [1] John Jim Jane Jill
## Levels: Jane Jill Jim John
Notice that R thinks that dfr1$FirstName
is a categorical variable and so it’s treating it like a factor, not a character vector. Let’s get rid of the factor by telling R to treat ‘FirstName’ as a vector:
dfr1$FirstName <- as.vector(dfr1$FirstName)
Alternatively, you can tell R you don’t like factors from the start using stringsAsFactors=FALSE
dfr2 <- data.frame(FirstName=c("John","Jim","Jane","Jill"), stringsAsFactors=F)
dfr2$FirstName # Success: not a factor.
## [1] "John" "Jim" "Jane" "Jill"
Access elements of the data frame:
dfr1[1,] # First row, all columns
dfr1[,1] # First column, all rows
dfr1$Age # Age column, all rows
dfr1[1:2,3:4] # Rows 1 and 2, columns 3 and 4 - the gender and age of John & Jim
dfr1[c(1,3),] # Rows 1 and 3, all columns
Find the names of everyone over the age of 30 in the data:
dfr1[dfr1$Age>30,2]
## [1] "Jim" "Jane" "Jill"
Find the average age of all females in the data:
mean ( dfr1[dfr1$Female==TRUE,4] )
## [1] 49.5
The controls and loops in R are fairly straightforward (see below). They determine if a block of code will be executed, and how many times. Blocks of code in R are enclosed in curly brackets {}
.
# if (condition) expr1 else expr2
x <- 5; y <- 10
if (x==0) y <- 0 else y <- y/x #
y
## [1] 2
# for (variable in sequence) expr
ASum <- 0; AProd <- 1
for (i in 1:x)
{
ASum <- ASum + i
AProd <- AProd * i
}
ASum # equivalent to sum(1:x)
## [1] 15
AProd # equivalemt to prod(1:x)
## [1] 120
# while (condintion) expr
while (x > 0) {print(x); x <- x-1;}
# repeat expr, use break to exit the loop
repeat { print(x); x <- x+1; if (x>10) break}
In most R functions, you can use named colors, hex, or RGB values. In the simple base R plot chart below, x
and y
are the point coordinates, pch
is the point symbol shape, cex
is the point size, and col
is the color. To see the parameters for plotting in base R, check out ?par
plot(x=1:10, y=rep(5,10), pch=19, cex=3, col="dark red")
points(x=1:10, y=rep(6, 10), pch=19, cex=3, col="557799")
points(x=1:10, y=rep(4, 10), pch=19, cex=3, col=rgb(.25, .5, .3))
You may notice that RGB here ranges from 0 to 1. While this is the R default, you can also set it for to the 0-255 range using something like rgb(10, 100, 100, maxColorValue=255)
.
We can set the opacity/transparency of an element using the parameter alpha
(range 0-1):
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=rgb(.25, .5, .3, alpha=.5), xlim=c(0,6))
If we have a hex color representation, we can set the transparency alpha using adjustcolor
from package grDevices
. For fun, let’s also set the plot background to gray using the par()
function for graphical parameters.
par(bg="gray40")
col.tr <- grDevices::adjustcolor("557799", alpha=0.7)
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=col.tr, xlim=c(0,6))
If you plan on using the built-in color names, here’s how to list all of them:
colors() # List all named colors
grep("blue", colors(), value=T) # Colors that have "blue" in the name
In many cases, we need a number of contrasting colors, or multiple shades of a color. R comes with some predefined palette function that can generate those for us. For example:
pal1 <- heat.colors(5, alpha=1) # 5 colors from the heat palette, opaque
pal2 <- rainbow(5, alpha=.5) # 5 colors from the heat palette, transparent
plot(x=1:10, y=1:10, pch=19, cex=5, col=pal1)
plot(x=1:10, y=1:10, pch=19, cex=5, col=pal2)
We can also generate our own gradients using colorRampPalette
. Note that colorRampPalette
returns a function that we can use to generate as many colors from that palette as we need.
palf <- colorRampPalette(c("gray80", "dark red"))
plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))
To add transparency to colorRampPalette, you need to use a parameter alpha=TRUE
:
palf <- colorRampPalette(c(rgb(1,1,1, .2),rgb(.8,0,0, .7)), alpha=TRUE)
plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))
While I generate many (and often very creative) errors in R, there are three simple things that will most often go wrong for me. Those include:
Capitalization. R is case sensitive - a graph vertex named “Jack” is not the same as one named “jack”. The function rowSums
won’t work if spelled as rowsums
or RowSums
.
Object class. While many functions are willing to take anything you throw at them, some will still surprisingly require character vector or a factor instead of a numeric vector, or a matrix instead of a data frame. Functions will also occasionally return results in an unexpected formats.
Package namespaces. Occasionally problems will arise when different packages contain functions with the same name. R may warn you about this by saying something like “The following object(s) are masked from ‘package:igraph’ as you load a package. One way to deal with this is to call functions from a package explicitly using ::
. For instance, if function blah()
is present in packages A and B, you can call A::blah
and B::blah
. In other cases the problem is more complicated, and you may have to load packages in certain order, or not use them together at all. For example (and pertinent to this workshop), igraph
and Statnet
packages cause some problems when loaded at the same time. It is best to detach one before loading the other.
library(igraph) # load a package
detach(package:igraph) # detach a package
For more advanced troubleshooting, check out try()
, tryCatch()
, and debug()
.
rm(list = ls()) # Remove all the objects we created so far.
library(igraph) # Load the igraph package
The code below generates an undirected graph with three edges. The numbers are interpreted as vertex IDs, so the edges are 1–>2, 2–>3, 3–>1.
g1 <- graph( edges=c(1,2, 2,3, 3, 1), n=3, directed=F )
plot(g1) # A simple plot of the network - we'll talk more about plots later
class(g1)
## [1] "igraph"
g1
## IGRAPH U--- 3 3 --
## + edges:
## [1] 1--2 2--3 1--3
# Now with 10 vertices, and directed by default:
g2 <- graph( edges=c(1,2, 2,3, 3, 1), n=10 )
plot(g2)
g2
## IGRAPH D--- 10 3 --
## + edges:
## [1] 1->2 2->3 3->1
g3 <- graph( c("John", "Jim", "Jim", "Jill", "Jill", "John")) # named vertices
# When the edge list has vertex names, the number of nodes is not needed
plot(g3)
g3
## IGRAPH DN-- 3 3 --
## + attr: name (v/c)
## + edges (vertex names):
## [1] John->Jim Jim ->Jill Jill->John
g4 <- graph( c("John", "Jim", "Jim", "Jack", "Jim", "Jack", "John", "John"),
isolates=c("Jesse", "Janis", "Jennifer", "Justin") )
# In named graphs we can specify isolates by providing a list of their names.
plot(g4, edge.arrow.size=.5, vertex.color="gold", vertex.size=15,
vertex.frame.color="gray", vertex.label.color="black",
vertex.label.cex=0.8, vertex.label.dist=2, edge.curved=0.2)
Small graphs can also be generated with a description of this kind: -
for undirected tie, +-
or -+
for directed ties pointing left & right, ++
for a symmetric tie, and “:” for sets of vertices.
plot(graph_from_literal(a---b, b---c)) # the number of dashes doesn't matter
plot(graph_from_literal(a--+b, b+--c))
plot(graph_from_literal(a+-+b, b+-+c))
plot(graph_from_literal(a:b:c---c:d:e))
gl <- graph_from_literal(a-b-c-d-e-f, a-g-h-b, h-e:f:i, j)
plot(gl)
Access vertices and edges:
E(g4) # The edges of the object
## + 4/4 edges (vertex names):
## [1] John->Jim Jim ->Jack Jim ->Jack John->John
V(g4) # The vertices of the object
## + 7/7 vertices, named:
## [1] John Jim Jack Jesse Janis Jennifer Justin
You can also examine the network matrix directly:
g4[]
## 7 x 7 sparse Matrix of class "dgCMatrix"
## John Jim Jack Jesse Janis Jennifer Justin
## John 1 1 . . . . .
## Jim . . 2 . . . .
## Jack . . . . . . .
## Jesse . . . . . . .
## Janis . . . . . . .
## Jennifer . . . . . . .
## Justin . . . . . . .
g4[1,]
## John Jim Jack Jesse Janis Jennifer Justin
## 1 1 0 0 0 0 0
Add attributes to the network, vertices, or edges:
V(g4)$name # automatically generated when we created the network.
## [1] "John" "Jim" "Jack" "Jesse" "Janis" "Jennifer"
## [7] "Justin"
V(g4)$gender <- c("male", "male", "male", "male", "female", "female", "male")
E(g4)$type <- "email" # Edge attribute, assign "email" to all edges
E(g4)$weight <- 10 # Edge weight, setting all existing edges to 10
Examine attributes:
edge_attr(g4)
## $type
## [1] "email" "email" "email" "email"
##
## $weight
## [1] 10 10 10 10
vertex_attr(g4)
## $name
## [1] "John" "Jim" "Jack" "Jesse" "Janis" "Jennifer"
## [7] "Justin"
##
## $gender
## [1] "male" "male" "male" "male" "female" "female" "male"
graph_attr(g4)
## named list()
Another way to set attributes (you can similarly use set_edge_attr()
, set_vertex_attr()
, etc.):
g4 <- set_graph_attr(g4, "name", "Email Network")
g4 <- set_graph_attr(g4, "something", "A thing")
graph_attr_names(g4)
## [1] "name" "something"
graph_attr(g4, "name")
## [1] "Email Network"
graph_attr(g4)
## $name
## [1] "Email Network"
##
## $something
## [1] "A thing"
g4 <- delete_graph_attr(g4, "something")
graph_attr(g4)
## $name
## [1] "Email Network"
plot(g4, edge.arrow.size=.5, vertex.label.color="black", vertex.label.dist=1.5,
vertex.color=c( "pink", "skyblue")[1+(V(g4)$gender=="male")] )
The graph g4
has two edges going from Jim to Jack, and a loop from John to himself. We can simplify our graph to remove loops & multiple edges between the same nodes. Use edge.attr.comb
to indicate how edge attributes are to be combined - possible options include sum
, mean
, prod
(product), min
, max
, first
/last
(selects the first/last edge’s attribute). Option “ignore” says the attribute should be disregarded and dropped.
g4s <- simplify( g4, remove.multiple = T, remove.loops = F,
edge.attr.comb=c(weight="sum", type="ignore") )
plot(g4s, vertex.label.dist=1.5)
g4s
## IGRAPH DNW- 7 3 -- Email Network
## + attr: name (g/c), name (v/c), gender (v/c), weight (e/n)
## + edges (vertex names):
## [1] John->John John->Jim Jim ->Jack
The description of an igraph object starts with up to four letters:
name
attribute)weight
attribute)type
attribute)The two numbers that follow (7 5) refer to the number of nodes and edges in the graph. The description also lists node & edge attributes, for example:
(g/c)
- graph-level character attribute(v/c)
- vertex-level character attribute(e/n)
- edge-level numeric attributeEmpty graph
eg <- make_empty_graph(40)
plot(eg, vertex.size=10, vertex.label=NA)
Full graph
fg <- make_full_graph(40)
plot(fg, vertex.size=10, vertex.label=NA)
Simple star graph
st <- make_star(40)
plot(st, vertex.size=10, vertex.label=NA)
Tree graph
tr <- make_tree(40, children = 3, mode = "undirected")
plot(tr, vertex.size=10, vertex.label=NA)
Ring graph
rn <- make_ring(40)
plot(rn, vertex.size=10, vertex.label=NA)
Erdos-Renyi random graph model
(‘n’ is number of nodes, ‘m’ is the number of edges).
er <- sample_gnm(n=100, m=40)
plot(er, vertex.size=6, vertex.label=NA)
Watts-Strogatz small-world model
Creates a lattice (with dim
dimensions and size
nodes across dimension) and rewires edges randomly with probability p
. The neighborhood in which edges are connected is nei
. You can allow loops
and multiple
edges.
sw <- sample_smallworld(dim=2, size=10, nei=1, p=0.1)
plot(sw, vertex.size=6, vertex.label=NA, layout=layout_in_circle)
Barabasi-Albert preferential attachment model for scale-free graphs
(n
is number of nodes,