Analysing the social Star Wars network in The Attack of the Clones with R

2016-08-07

This is a free adaptation of two (very) clever analyses made by others:

The Star Wars Social Network by Evelina Gabasov in which program F# was mostly used to analyse the Star wars social networks
Analyzing networks of characters in ‘Love Actually’ by David Robinson in which R was used to analyse the links between the characters of the movie Love Actually.

The aim here is to try and reproduce Evelina’s analysis using R only, using David’s contribution plus several tweaks I found here and there on the internet. The R code and data are available on my GitHub page.

Disclaimer: The original blog posts are awesome and full of relevant details, check them out! My objective here was to teach myself how to manipulate data using trendy R packages and do some network analyses. Some comments below have been copied and pasted from these blogs, the credits entirely go to the authors Evelina and David. Last but not least, my code comes with mistakes probably.

Read and format data

First, read in data. I found the movie script in doc format here, which I converted in txt format for convenience. Then, apply various treatments to have the data ready for analysis. I use the old school way for modifying the original dataframe. Piping would have made the code more readable, but I do not feel confident with this approach yet.

# load convenient packages
library(dplyr)
library(stringr)
library(tidyr)

# read file line by line 
raw <- readLines("attack-of-the-clones.txt")

# create data frame
lines <- data_frame(raw = raw) 

# get rid of leading and trailing white spaces
# http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-whitespace-in-r
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
lines <- mutate(lines,raw=trim(raw))

# get rid of the empty lines
lines2 <- filter(lines, raw != "")

# detect scenes: begin by EXT. or INT.
lines3 <-  mutate(lines2, is_scene = str_detect(raw, "T."),scene = cumsum(is_scene)) 

# drop lines that start with EXT. or INT.
lines4 <- filter(lines3,!is_scene)

# distinguish characters from what they say
lines5 <- separate(lines4, raw, c("speaker", "dialogue"), sep = ":", fill = "left",extra='drop')

# read in aliases (from Evelina's post)
aliases <- read.table('aliases.csv',sep=',',header=T,colClasses = "character")
aliases$Alias

##  [1] "BEN"           "SEE-THREEPIO"  "THREEPIO"      "ARTOO-DETOO"  
##  [5] "ARTOO"         "PALPATINE"     "DARTH SIDIOUS" "BAIL"         
##  [9] "MACE"          "WINDU"         "MACE-WINDU"    "NUTE"         
## [13] "AUNT BERU"     "DOOKU"         "BOBA"          "JANGO"        
## [17] "PANAKA"        "NUTE"          "KI-ADI"        "BIBBLE"       
## [21] "BIB"           "CHEWIE"        "VADER"

aliases$Name

##  [1] "OBI-WAN"        "C-3PO"          "C-3PO"          "R2-D2"         
##  [5] "R2-D2"          "EMPEROR"        "EMPEROR"        "BAIL ORGANA"   
##  [9] "MACE WINDU"     "MACE WINDU"     "MACE WINDU"     "NUTE GUNRAY"   
## [13] "BERU"           "COUNT DOOKU"    "BOBA FETT"      "JANGO FETT"    
## [17] "CAPTAIN PANAKA" "NUTE GUNRAY"    "KI-ADI-MUNDI"   "SIO BIBBLE"    
## [21] "BIB FORTUNA"    "CHEWBACCA"      "DARTH VADER"

# assign unique name to characters
# http://stackoverflow.com/questions/28593265/is-there-a-function-like-switch-which-works-inside-of-dplyrmutate
multipleReplace <- function(x, what, by) {
  stopifnot(length(what)==length(by))               
  ind <- match(x, what)
  ifelse(is.na(ind),x,by[ind])
}
lines6 <- mutate(lines5,speaker=multipleReplace(speaker,what=aliases$Alias,by=aliases$Name))

# read in actual names (from Evelina's post)
actual.names <- read.csv('characters.csv',header=F,colClasses = "character")
actual.names <- c(as.matrix(actual.names))
# filter out non-characters
lines7 <- filter(lines6,speaker %in% actual.names)

# group by scene
lines8 <- group_by(lines7, scene, line = cumsum(!is.na(speaker))) 

lines9 <- summarize(lines8, speaker = speaker[1], dialogue = str_c(dialogue, collapse = " "))

# Count the lines-per-scene-per-character
# Turn the result into a binary speaker-by-scene matrix
by_speaker_scene <- count(lines9, scene, speaker)
by_speaker_scene

## # A tibble: 447 x 3
## # Groups:   scene [321]
##    scene    speaker     n
##    <int>      <chr> <int>
##  1    26      PADME     1
##  2    27      PADME     1
##  3    29      PADME     1
##  4    48      PADME     1
##  5    50      PADME     2
##  6    66 MACE WINDU     1
##  7    67 MACE WINDU     1
##  8    69       YODA     1
##  9    70 MACE WINDU     1
## 10    74       YODA     1
## # ... with 437 more rows

library(reshape2)
speaker_scene_matrix <-acast(by_speaker_scene , speaker ~ scene, fun.aggregate = length)
dim(speaker_scene_matrix)

## [1]  19 321

Analyses

Hierarchical clustering

norm <- speaker_scene_matrix / rowSums(speaker_scene_matrix)
h <- hclust(dist(norm, method = "manhattan"))
plot(h)

Timeline

Use tree to give an ordering that puts similar characters close together

ordering <- h$labels[h$order]
ordering

##  [1] "MACE WINDU"  "YODA"        "SHMI"        "QUI-GON"     "PLO KOON"   
##  [6] "LAMA SU"     "OBI-WAN"     "BAIL ORGANA" "JAR JAR"     "POGGLE"     
## [11] "ANAKIN"      "PADME"       "CLIEGG"      "BERU"        "OWEN"       
## [16] "SIO BIBBLE"  "RUWEE"       "JOBAL"       "SOLA"

This ordering can be used to make other graphs more informative. For instance, we can visualize a timeline of all scenes:

scenes <-  filter(by_speaker_scene, n() > 1) # scenes with > 1 character
scenes2 <- ungroup(scenes)
scenes3 <- mutate(scenes2, scene = as.numeric(factor(scene)),
           character = factor(speaker, levels = ordering))
library(ggplot2)
ggplot(scenes3, aes(scene, character)) +
    geom_point() +
    geom_path(aes(group = scene))

Create a cooccurence matrix (see here) containing how many times two characters share scenes

cooccur <- speaker_scene_matrix %*% t(speaker_scene_matrix)
heatmap(cooccur)

Graphical representation of the network

Here the nodes represent characters in the movies. The characters are connected by a link if they both speak in the same scene. And the more the characters speak together, the thicker the link between them.

library(igraph)
g <- graph.adjacency(cooccur, weighted = TRUE, mode = "undirected", diag = FALSE)
plot(g, edge.width = E(g)$weight)

Compute standard network features, degree and betweeness.

degree(g)

##      ANAKIN BAIL ORGANA        BERU      CLIEGG     JAR JAR       JOBAL 
##          12           1           4           4           4           4 
##     LAMA SU  MACE WINDU     OBI-WAN        OWEN       PADME    PLO KOON 
##           1           5           6           4          12           0 
##      POGGLE     QUI-GON       RUWEE        SHMI  SIO BIBBLE        SOLA 
##           1           1           4           1           0           4 
##        YODA 
##           4

betweenness(g)

##      ANAKIN BAIL ORGANA        BERU      CLIEGG     JAR JAR       JOBAL 
##   42.600000    0.000000    1.750000    0.500000   22.000000    0.000000 
##     LAMA SU  MACE WINDU     OBI-WAN        OWEN       PADME    PLO KOON 
##    0.000000   18.366667   15.000000    5.250000   55.133333    0.000000 
##      POGGLE     QUI-GON       RUWEE        SHMI  SIO BIBBLE        SOLA 
##    0.000000    0.000000    0.700000    0.000000    0.000000    5.000000 
##        YODA 
##    3.366667

To get a nicer representation of the network, see here and the formating from igraph to d3Network. Below is the code you’d need:

library(d3Network)
library(networkD3)
sg <- simplify(g)
df <- get.edgelist(g, names=TRUE)
df <- as.data.frame(df)
colnames(df) <- c('source', 'target')
df$value <- rep(1, nrow(df))
# get communities
fc <- fastgreedy.community(g)
com <- membership(fc)
node.info <- data.frame(name=names(com), group=as.vector(com))
links <- data.frame(source=match(df$source, node.info$name)-1,target=match(df$target, node.info$name)-1,value=df$value)

forceNetwork(Links = links, Nodes = node.info,Source = "source", Target = "target",Value = "value", NodeID = "name",Group = "group", opacity = 1, opacityNoHover=1)

The nodes represent characters in the movies. The characters are connected by a link if they both speak in the same scene. The colors are for groups obtained by some algorithms.

Evolutionary biology & comparative genomics

I’m an associate professor scientist working at the interface of animal ecology, statistical modeling and social sciences.