Network Analysis on #Badminton tweets on twitter

Aki Kapoor
5 min readFeb 13, 2021

Pre-requisites Twitter is a widely used social network. I chosed a sport i.e badminton to analyze the network. #Badminton was used to search the tweets and a file containing all the necessary data was stored as comma separated value format i.e file having .csv extension. The file was stored in the local system and its path is used to build and analyze the network. Packages used Following packages were used to build and perform the network analysis.

  1. The tidyverse :- The tidyverse is an opinionated collection of R packages designed for data science.
  2. igraph :- It was used to plot the graph for analyzing the built network.
  3. CINNA :- It was used to check the giant component in the network.
  4. ERGM :- Exponential family random graph models (ERGM)are build to explain the global structure of a network while allowing inference on tie prediction on a micro level.
  5. Tidygraph :- It is used to provide tidy framework to manipulate different data frame.( as in the context used).

Libraries used Following libraries were used to build and perform the network analysis.

  1. Tidygraph
  2. Igraph
  3. Ggraph
  4. Dplyr
  5. CINNA

Building the network

The comma separated values file named badminton_tweets was read and stored as a dataframe named badminton_dataframe. To read the file, its path was given. However, if the same file is used in different system, its path must be changed and given accordingly. After creating the dataframe, it is been checked by given the dataframe name. Just having a look at the dataframe, it was observed that some of the tweets were not in the language in which analysis was performed. It was in Chinese. This might be because badminton is a common Asian sport. To get a high level view of the dataframe, glimpse() function was used. It was observed that the data had 1700 rows and 91 columns. These rows might contain some missing values, so filter() function was used to remove that rows and then again the glimpse() function was used to analyse the data. However the number of rows didn’t changed. This is because the column we are analyzing mentions_screen_name doesn’t contain any missing value.

Columns selected to build the network

The columns that were chosen to build the network were screen_name and mentions_screen_name. People that put #badminton in their tweets and mentions the name of other screen maybe friends or badminton partner. There could be a strong relation between them. Earlier reply_to_use_id and retweet_count were considered but there was not sufficient data for them, so screen_name and mentions_screen_name were finally the two columns which related with each other and which have some connection among them.

Creating a new dataframe

It’s never a good idea to make changes in the original dataframe as the original file might be required later in the analysis. So, in this case reading the original file will be the only option left if we are trying to make any changes. For instance, reply_to_use_id was considered first as a column where analysis could be performed which was later changed to mentions_screen_name due to insufficient data to analyse. A new dataframe named badminton_dataframe_temp was hence created which contains only 2 columns “screen_name” and “mentions_screen_name”.

Nodes and Links

A network is simply a number of points (or ‘nodes’) that are connected by links. Generally in social network analysis, the nodes are people and the links are any social connection between them. In this case, screen_name and mentions_screen_name contain nodes and links.

Creating link

The connection between both the nodes in established and named as “links”. The nodes are renamed to “To” and “From” so that it is easier to find the link.

Creating Edge

List Edge list is the set of values that contains edges i.e links and vertices i.e nodes. Edge list was created, and the nodes and links were observed.

Choosing a graph

Since both the nodes are qualitative, so making g2graph doesn’t makes any sense. So, igraph was chosen to represent the network. The data was again observed as edge list to see the relation.

Structure of the network without labels:

Analysis

  • The graph is an undirected graph. If a screen_name mentions the other screen_name, it means they have some relation between them. Either they can be team partners or opponents. It is a mutual relationship; hence graph is undirected.
  • The graph is not a connected graph. Connected graphs are the graphs where we can make a walk from one vertex to another.
  • The network has a low edge density. It is sparse. Sparse network is the network where number of links are close to minimum number of possible links. The edge density of the network is 0.0013
  • The network has many giant components but no single giant component that can cover most of the networks.
  • If we observe a single giant component, it contains many nodes but still not cover the whole network. This means that there are many nodes in the network that may not contain even a single link.

Cluster coefficient or transitivity is coming to be 0.001. This is because clustering coefficient is the probability that adjacent vertices of adjacent vertex are also adjacent. There are very less triangles in the structure of network. Hence, transitivity is close to 0.

  • Motifs are NA. Motifs are the subgraphs that repeat themselves in the network. Since, the sub graphs are not connected, so it is NA.
  • Neighbours of the nodes are the adjacent node of the node. There are more than 1 neighbour for many nodes (even 5,6 in many cases). However, there are also lot of cases where neighbour of the node is only 1.
  • Degree which is the number of links a node has is variable in each case.
  • The average degree of the network is 2.13, this means on an average, each node has two links.

Comparison with the other network — ERGM: exponential random graph models

The goal of using ERGM is to make regression analysis on complex networks. They use both the features and the topology to understand the statistical relations present in the network data. The parameters that were founded are positive. It means the link probability increase with increasing features or matching values.

It can be observed that the model is a good fit.

--

--

Aki Kapoor

Masters in Applied data science, University of Canterbury, New Zealand. Data scientist who loves to play with the data and make sense from it.