A recent study by Vosoughi et al. (2018) revealed that the spread of news in online social media exhibit different statistical properties given the veracity of the news. They showed, using various metrics made on the cascade of transmission of information, that false news travel farther and faster than true ones, suggesting that the intrinsic value of the information shared on social media may impact the way this information travel. The goal of the tools developed in this package is to test and compare hypothesis to explain the processes behind these differences. More precisely, we think that different level of sensibilities among the people participating in those cascades is an important factor explaining the difference observed by Vosoughi et al. and we aim at modeling them.
To illustrate that, we used a model developed by Ruck et al. (2017) to explore changes in the popularity of words in the English vocabulary. We added to this model the possibility for the agents to recognize the content of the information and a sensibility toward this information. Thus, an agent can be actively willing to share only one kind of information, or they can be neutral toward it, and share equally any kind of information. Then, the probability of an agent to share one kind of information will depend on 1/ the sensibility of the agent toward this kind of information and 2/ the distribution of the other kind of information the agent has access to.
We show here that this model can reproduce most of the properties observed by Vosoughi et al (2018). Moreover, and given the availability of the data used in the original paper, we applied Approximate Bayesian Computation to find what mixture of sensibilities among the actual social media actors has the highest probability to reproduce the difference observed.
Stable versions of the spreadrt package should be, at some point, directly installed from CRAN (using the command install.packages("nameofthepackage")
), whilst the development version can be installed from the github repository using the following command (the function requires the devtools package):
devtools::install_github("simoncarrignon/spreadrt")
Notice that so far this package is, and will long remains, a development version only.
Given the dataset available in data(allca)
, we can reproduce the key elements presented by Vosoughi et al. (2018).
par(mfrow=c(1,3),mar=c(5,5,2,1))
for( m in c("size","depth","breadth")){
plotCCFD(allca[,m],main=paste("cascade",m),xlab=m)
pointsCCFD(mixedca[,m],col="orange")
pointsCCFD(falseca[,m],col="red")
pointsCCFD(trueca[,m],col="green")
legend("bottomleft",legend=c("true","false","mixed","all"),col=c("green","red","orange","black"),pch=20)
}
reproduction of Vosoughi results.
We can check how the three measure are correlated:
par(mfrow=c(1,3),mar=c(5,5,1,1))
plot(allca$size ~ allca$breadth,log="xy",ylab="cascades size",xlab="cascade breadth")
plot(allca$size ~ allca$depth,log="xy",ylab="cascades size",xlab="cascade depth")
plot(allca$depth ~ allca$breadth,log="xy",ylab="cascades depth",xlab="cascade breadth")
We can clearly see that the main driver of the size of the cascades is its breadth, ie the number of simultaneous retweets.
The dataset (data(allca)
) gives for each cascade the exact date of apparition of this cascade. A cascade is a single tweet, which will be retweeted a certain amount of time (or not, in which case the size of the cascade is \(1\)). Different cascades can spread the same kind of information ie the same rumor.
Using that, we can see how many new cascades appear every month/days/hour…
par(mfrow=c(1,2))
par(mar=c(5,5,1,1))
alldate=as.POSIXct(allca$date)
allca$hours=strptime(alldate,format="%Y-%m-%d %H", tz="GMT") #trunc by hours
allca$days=strptime(alldate,format="%Y-%m-%d", tz="GMT") #trunc by days
allca$years=strptime(alldate,format="%Y", tz="GMT") #trunc by year
plot(sort(unique(allca$days)),as.numeric(table(sort(allca$days-min(allca$days)))),type="l",log="y",ylim=c(1,2000),ylab="new cascades",xlab="day",cex=.8)
plot(sort(unique(allca$hours)),as.numeric(table(sort(allca$hours-min(allca$hours)))),type="l",log="y",ylim=c(1,1000),ylab="new cascades",xlab="hours",cex=.8)
plot(sort(unique(allca$years)),as.numeric(table(sort(allca$years-min(allca$years)))),type="l",ylab="new cascades",xlab="years",cex=.8)
Given that we also know the id of the rumor carried by every cascade, we can look when was the first time that a unique rumor appears.
Knowing that, it’s possible to calculate the probability that a new rumor appears every month/day/hours. We can in turn generate some rate of introduction of new cascade.
par(mfrow=c(1,2),mar=c(5,5,1,1))
rateintro=sort(tapply(allca$days-min(allca$days),allca$rumor_id,min)) #for each rumor we assign the date of the first cascade that's spread it.
alldayz=seq(min(rateintro),max(rateintro),3600*24) #generate all possible days, will be used to avoid skipping the day whith not new rumors in order to to take into account the day where number of rumor = 0
freqD=as.numeric(table(factor(rateintro,levels=alldayz)))
plot(as.POSIXct(alldayz,origin=min(allca$days),tz="GMT"),freqD,type="l",xlab="days",ylab="new rumors",cex=.8)
plot(density(freqD,bw=1,from=0),xlab="Proba of new rumors during one day",main="")
Number of new rumor for each day of the dataset (left), probability of apparition of a new rumor during one day at any given day (right)
par(mfrow=c(1,2),mar=c(5,5,1,1))
##repeat all the same than befor but using the hourly split
rateintro=sort(tapply(allca$hours-min(allca$hours),allca$rumor_id,min))
allhourz=seq(min(rateintro),max(rateintro),3600) #generate all possible hours
freqH=as.numeric(table(factor(rateintro,levels=allhourz)))
plot(as.POSIXct(allhourz,origin=min(allca$hours),tz="GMT"),freqH,type="l",xlab="hours",ylab="new rumors",cex=.8)
plot(density(freqH,bw=1,from=0),xlab="Proba of new rumor during one hour",main="")
Number of new rumor for each hour of the dataset (left), probability of apparition of a new rumor during one hour at any given hour (right)
Ruck et al. (2017) have proposed a variation of the random copy model by Bentley et al 2002 to study change in vocabulary in the English litterature. A version of this model is available in this package using the function randomCascades
. The simplest way to use it is to run it with the default arguments
onerun=randomCascades()
summary(onerun)
## rank U size
## Min. : 1 Min. : 1 Min. : 1.0
## 1st Qu.:11721 1st Qu.:11721 1st Qu.: 2.0
## Median :23442 Median :23442 Median : 9.0
## Mean :23442 Mean :23442 Mean : 106.7
## 3rd Qu.:35162 3rd Qu.:35162 3rd Qu.: 43.0
## Max. :46882 Max. :46882 Max. :5232.0
Thus onerun
countains the final result in the form of a 3-columns table where the colon correspon to the rank, the size and the id of the last retweeter for each cascades.
It’s easy to plot the CCFD of the size of each cascades from this table by using the funciton plotCCFD()
par(mfrow=c(1,1))
plotCCFD(onerun$size,xlab="cascade size",main="cascade size for default model")
It’s also easy to explore the impact of the different parameters on this model. Let’s try with less number of time step and less agents
shorter=randomCascades(t_steps = 10,Nmin = 10000)
plotCCFD(shorter$size,xlab="cascades size",main="Cascade size for shorter model",col="grey")
One can also compare the result of the simulation to the real cascades observed in the dataset as shown previously.
data(allca)
plotCCFD(allca$size,xlab="cascades size",col="black")
pointsCCFD(shorter$size,col="grey")
legend("bottomleft",legend=c("data","simulation (short)"),col=c("black","grey"),pch=20)
compare with data
One can explore more in depth the model. For example the impact of \(\mu\) on the CCFD given the length of the simulations
mu=10^seq(-5,-1,length.out=20) #a sequence of 20 different mu logarithmically spaced from 10^-5 to 10^-1
step=seq(10,90,length.out=4) #a sequence of different number of step
cols=heat.colors(length(mu)) #creat colors for each mu
par(mfrow=c(2,2),mar=c(5,5,3,1))
for(s in step){
plotCCFD(1,1,type="n",ylim=c(0.01,100),xlim=c(1,10000),main=paste("time=",round(s)))
sapply(1:length(mu),function(i)
{
pointsCCFD(randomCascades(Nmin=1000,tau=1,t_steps=s,mu=mu[i])$size,col=alpha(cols[i],.8),type="l",lwd=2,cex=1)
}
)
}
or the number of agents
par(mfrow=c(2,2))
nagent=c(100,1000,10000,20000) #a sequence of different number of step
for(n in nagent){
plotCCFD(1,1,type="n",ylim=c(0.0001,100),xlim=c(1,100000),main=paste("Num of Agents=",n))
sapply(1:length(mu),function(i)
{
pointsCCFD(randomCascades(Nmin=n,tau=1,t_steps=s,mu=mu[i])$size,col=alpha(cols[i],.8),type="l",lwd=2,cex=1)
}
)
}
We have shown how to simply compare the output of a simulation with the data from data(allca)
in .
If we compare this with the quick exploration of the simulation we did in the previous section, it appears clearly that this model will hardly be able to reproduce the distributions observed in dataset.
Nonetheless, the model implemented here doest not make a difference between a cascade and the rumor propagate by its cascade: the number of time a unique information is shared is exactly equal to the size of the cascade.
Or the distribution of the size in the original dataset are drawn counting each cascade sizes, as Vosoughi et al. considere it. For them each cascade is a different entity with a different size. You can have multiple cascades spreading the same rumor and you will cont them as two separate cascades of different sizes.
This is somehow counter-intuitive if we want to consider the spread of the information carried by the cascade. This information correspond to different rumors that one could expect to be the unit on which the metrics should be considered.
To express that, the graph of the section 1 have to be redraw this time by summing the size of all the cascades carrying the same rumors. In this case, the size will be directly the number of time a rumor has been shared, without regarding if this rumours has been share via different cascades, which is how the spread of information is implemented in the randomCascades
model.
allru=c()
allru$size=tapply(allca$size,allca$rumor_id,sum)
trueru=c()
trueru$size=tapply(trueca$size,trueca$rumor_id,sum)
falseru=c()
falseru$size=tapply(falseca$size,falseca$rumor_id,sum)
mixedru=c()
mixedru$size=tapply(mixedca$size,mixedca$rumor_id,sum)
Once we recounted the metrics by grouping all the cascades carrying the same rumor as one elements we can then plot the same graph than before
par(mfrow=c(1,1),mar=c(5,5,1,1))
plotCCFD(allru$size,main="rumor size",cex=1,xlab="rumor size")
pointsCCFD(trueru$size,col="green",cex=1)
pointsCCFD(falseru$size,col="red",cex=1)
legend("bottomleft",legend=c("true","false","all"),col=c("green","red","black"),pch=20)
In this case, the curves of the \(CCFD\) looks much more like the curves we observed in the exploration of the model.
plotCCFD(allru$size,main="rumor size",cex=1,xlab="rumor size")
othersimulation=randomCascades(t_steps = 90,Nmin = 10000,tau=12)
pointsCCFD(othersimulation$size,col="grey",cex=1)
legend("bottomleft",legend=c("data","simulation"),col=c("black","grey"))
We then propose a slightly modified version of this model with two goals:
Keep tracks of the different dimensions (size/depth/breadth) of the cascade of retweets as presented in the dataset
Integrate the notion of the content of the tweet that can affect the probability that some will retweet it
We propose the model ?cascades3D
to do so.
cascades3D
Generally speaking cascades3D
can be use in a similar way that randomCascades
. On notable difference is that in order to have all the metrics from we need to keep the whole structure of the tree of retweet. We provide with the function cascades3D
the possibility to return only the summary of the metrics or the list of all the trees generated during the simulation.
As for randomCascades
, cascades3D()
is given with a set of default parameters that allows a quick try:
simu=cascades3D()
## [1] "timestep 1 round 0 last round 0"
## [1] "round 200 ncascades: 50 actives: 50"
## [1] "timestep 2 round 200 last round 0"
## [1] "round 400 ncascades: 50 actives: 50"
summary(simu)
## U seeder rumor size
## Min. :0.0000 Min. : 1.00 Min. : 1.00 Min. : 2.00
## 1st Qu.:0.1389 1st Qu.: 53.25 1st Qu.: 2.25 1st Qu.: 4.25
## Median :0.4444 Median :112.50 Median : 5.00 Median : 7.00
## Mean :0.4400 Mean :104.48 Mean : 4.96 Mean : 8.14
## 3rd Qu.:0.6667 3rd Qu.:150.75 3rd Qu.: 7.00 3rd Qu.:10.75
## Max. :1.0000 Max. :196.00 Max. :10.00 Max. :22.00
## breadth depth
## Min. : 1.00 Min. :2.0
## 1st Qu.: 3.00 1st Qu.:3.0
## Median : 4.00 Median :3.0
## Mean : 4.62 Mean :2.8
## 3rd Qu.: 6.00 3rd Qu.:3.0
## Max. :11.00 Max. :3.0
The default behavior of cacades3D is to return a dataframe where each line represent a cascade generated during the simulation, and each column provide different iformation abnout the cascade:
rumor: id of the rumor associated with the cascade
U: utility associated with the rumor of the cascade (if it’s true or false)
ID: unique ID representing the cascade, it’s buid by the concatenation of the ID of the seed and the rumor involved in this cascade
size: total size of the cascade ie the number of unique agent involved in it
depth: depth of the cascade ie the longest path between the root to a leaf
breadth : maximum number of reweet at the same level of the cascade.
As for the previous model, it’s easy to plot the CCFD of the size of each cascades from this table by using the funciton plotCCFD()
plotCCFD(simu$size,cex=1)
we can check what happen if we wait more than the default time step
longer=cascades3D(time=20)
plotCCFD(longer$size,col="grey")
One can also compare the result of the simulation to the real cascades observed in the dataset as shown previously.
data(allca)
plotCCFD(allca$size,col="red",cex=1)
pointsCCFD(longer$size,cex=1,col="grey")
pointsCCFD(simu$size,cex=1)
legend("bottomleft",legend=c("data","simu","longer simu"),pch=20,col=c("red","black","grey"))
compare 3D with data
But, we this model, the other metrics can be studied and compared:
par(mfrow=c(1,3))
for( m in c("size","depth","breadth")){
plotCCFD(allca[,m],col="red",cex=1,main=m)
pointsCCFD(longer[,m],cex=1,col="grey")
pointsCCFD(simu[,m],cex=1)
legend("bottomleft",legend=c("data","simu","longer simu"),pch=20,col=c("red","black","grey"))
}
Not only this new model wants to reproduce the twitter infrastructure with more accuracy, but we also want to be able to test hyptothesis on how people can be affected by the content of the tweet.
To do so we introduce two new parameter : \(\beta\) and \(U\)
U : is the utility of a tweet, ie a way to qualify its content. \(\beta\) : is sensibility of the agents toward the content of the tweet.
Both parameter will interact to determine the probability \(p_{a}(i)\) of an agent \(a\) to retweet a tweet \(i\). this interaction is defined as follow:
\[p_{a}(i)= \frac{e^{ \beta_{a} * U_i}}{\sum^n_j\left(e^{\beta_{a} * U_i}\right)}\]
this function is given by:
probaSelection
## function(u,beta){
## p_i=exp(u * beta)
## p_i=p_i/sum(p_i)
## return(p_i)
## }
## <environment: namespace:spreadrt>
U will be distributed between -1 and 1, to allow the representation of a polarised information. \(\beta\) will be distributed between \(-\infty\) and \(\infty\), represanting the propensity of an agent to retweet eclusively one side or the other of the information spectrum, on to be neutral with regard to the content ( when \(\beta=0\))
Let’s see how the probability for different agent with different \(\beta \in [100,100]\) to selected a tweet \(i\) with utility \(u_i\), is changing when the number of tweet available increasing . We assume that utility of the tweet available is distributed uniformely between \([-1,1]\).
par(mfrow=c(2,2),mar=c(5,5,1,1))
beta=10^seq(-2,2,length.out=50) #generate 50 agents with $\beta \in {-100,-10,0,10,100}$
beta=sort(c(sort(-1*beta),0,beta))
cols=alpha(topo.colors(length(beta)),.8)
names(cols)=beta
rac=sapply(seq(5,20,length.out=4),function(n){
u=seq(-1,1,length.out=n) #ditribute utility of the _n_ tweet amongs [-1,1]
plot(u,probaSelection(u,beta[1]),ylim=c(0,1),type="n",main=paste("Number of tweets=",n),ylab=expression(P[i]),xlab=expression(U[i]))
sapply(beta,function(b)lines(u,probaSelection(u,b),lwd=1,col=cols[as.character(b)]))
legend("center",legend=c(expression(-10^2),expression(-10^1),0,expression(10^1),expression(10^2)),lwd=3,col=cols[c(1,25,59,75,101)],title=expression(beta))
})
Given this we can now refine the previous analyses to separate the metrics given the content (utility) of the tweets. The utility is given by the column
U
in the output of the function.
summary(simu$U)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1389 0.4444 0.4400 0.6667 1.0000
By default the utility is attributed to the initial rumors and distributed between \([0,1]\) as follow utility = seq(0,1, length.out = 10)
. During the simulation, the cascades will randomly get the utility of the rumor they propagate. At the end, the utility of all cascades should be evenly distributed between \([0,1]\).
r hist(simu$U,xlab="U")
To see how the previous metrics and this utility intereact lets just give a color of the cascade:
### generate colors for each utility
utilities = seq(0,1, length.out = 10) #all possible utility as defined in the default function `cascades3D`
cols=heat.colors(length(utilities))
names(cols)=utilities
ucol_l=cols[as.character(longer$U)]
ucol_s=cols[as.character(simu$U)]
ucol_*
are now vectors with a color representing the utility of the cascades, then let’s take the previous graph from fig~
par(mfrow=c(1,3))
for( m in c("size","depth","breadth")){
plotCCFD(allca[,m],,cex=1,main=m)
pointsCCFD(longer[,m],cex=1.5,col=ucol_l,pch=17)
#pointsCCFD(simu[,m],cex=1.5,col=ucol_s,pch=18)
legend("bottomleft",legend=c(paste0("U=",round(utilities,digit=2)),"lon simu","short simu","data"),col=c(cols,1,1,1),pch=c(rep(20,length(utilities)),17,18,20),cex=.8)
}
The function getUCols
allows to quickly generate this vector of colors.
In the default function the \(\beta\) for all the agents is the same and equal to \(0\). Which means, as we can see in the graph~ that the utility doesn’t affect the way they select the tweet to retweet. Let’s then create a polarized population with two kind of \(\beta\)-person.
n=200 #default number of agents in cascades3D
popbeta=c(rep(0,n/2),rep(100,n/2))
simu=cascades3D(betadistrib=popbeta,time=20) #we use the longer time step
par(mfrow=c(1,3))
for( m in c("size","depth","breadth")){
plotCCFD(allca[,m],cex=1,main=m,xlim=c(1,max(simu[,m],allca[,m])))
pointsCCFD(simu[,m],cex=1.5,col=spreadrt::getUCols(simu$U)[order(simu[,m])],pch=20)
legend("bottomleft",legend=c(paste0("U=",round(utilities,digit=2)),"data"),col=c(cols,1),pch=c(rep(20,length(utilities)),20),cex=.8)
}
We can see now that not only the shape of the curve changes dramatically, but know the cascades are cleary sorted given their utility. In this case, as the population is split between 0-\(\beta\) and 100-\(\beta\) agents, the cascades carrying rumor with \(U=1\) are the one with the biggest metrics as they are the only on that have a chance to be retweeted by anyone, regardless the \(\beta\) of the indivudual, wheras the cascades carrying rumor with \(U=0\) will only be retweeted by 0-\(\beta\) individuals.
Lets now have a population size split in 3 subpopulation lets generate three kind of rumors where \(U\in \{-1,0,1\}\):
n=300 #we change the number of agents to easily divide by 3
popbeta=c(rep(-100,n/3),rep(0,n/3),rep(100,n/3))
r=30 #number of rumors
utilities=c(rep(-1,r/3),rep(0,r/3),rep(1,r/3))
simu=cascades3D(betadistrib=popbeta,N=n,R=r,utility=utilities,time=10)
par(mfrow=c(1,3))
for( m in c("size","depth","breadth")){
plotCCFD(allca[,m],cex=1,main=m,xlab=m)
pointsCCFD(simu[,m],cex=1.5,col=getUCols(simu$U),pch=20)
legend("bottomleft",legend=c(paste0("U=",round(unique(utilities),digit=2)),"data"),col=c(getUCols(unique(utilities)),1),pch=c(rep(20,length(unique(utilities))),20))
}
In those simulation the number of cascade is small compared to what appear in the data set $ncascade_{data}=$126301 vs $ncascade_{data}=$50, due to the fact that by default no new cascades appears in the simulation. This can be change by setting up the parameter mu_c > 0
and setting a bigger number of initial cascades IC
. To allow cascades to grow bigger, we also need to have more agent able to retweet it. But as the model keep track of all the cascades, allowing all the agents to retweet all the cascades regardless their apparition time, this can take long time and is not realistic. It is the equivalent of saying that at any time people will look at all the tweets that ever happen before retweeting something. To avoid that we set a captl
parameters that limit the total number of tweet an agent can access to, a Nmax
parameter that limits the number of agents that retweet at any given time and a dtime
that defines a number of timestep after which if not retweeted, a cascades cannot be accessed anymore by the agents.
n=3000 #we change the number of agents to easily divide by 3
popbeta=c(rep(-100,n/3),rep(0,n/3),rep(100,n/3))
r=30 #number of rumors
simu=cascades3D(betadistrib=popbeta,N=n,R=r,utility=utilities,time=20,mu_c=.02,IC=1000,captl=.1,dtime=3,Nmax=.1,log=F)
With more cascade we can now start to see there distribution separatly, as done by Vosoughi et al:
simu.t=simu[simu$U==1,]
simu.f=simu[simu$U==-1,]
simu.m=simu[simu$U==0,]
We separate here in three as in the dataset true
/false
/mixed
, though in this case the utility of the message should be seen more as different kind
of message that maybe retweet by different kind of people, regardless the information carried is true or false.
par(mfrow=c(1,3))
for( m in c("size","depth","breadth")){
plotCCFD(simu[,m],,cex=1,main=m)
pointsCCFD(simu.t[,m],cex=1.5,col="yellow",pch=20)
pointsCCFD(simu.f[,m],cex=1.5,col="red",pch=20)
pointsCCFD(simu.m[,m],cex=1.5,col="orange",pch=20)
legend("bottomleft",legend=c(paste0("U=",round(unique(utilities),digit=2)),"all"),col=c(getUCols(unique(utilities)),1),pch=c(rep(20,length(unique(utilities))),20))
}
As one can expect given the equation we provided, when the population is split in equal sub-group, the \(-1\) and \(1\) utility behave the same way, whereas the message with \(0\) utility is left behind.
Insted of looking only at the summary statistics of the cascades of the simulation, one may be interested to look entire structures of the cascades.
This is possible by setting to FALSE
the summary
argument in cascades3D
. Whem done, instead of returning a data.frame
the function return a list with 3 elements:
simu=cascades3D(summary = FALSE)
## [1] "timestep 1 round 0 last round 0"
## [1] "round 200 ncascades: 50 actives: 50"
## [1] "timestep 2 round 200 last round 0"
## [1] "round 400 ncascades: 50 actives: 50"
summary(simu)
## Length Class Mode
## agents 200 -none- list
## summary 3 data.frame list
## cascades 50 -none- list
simu$agents
: number of retweets done by each agents and in names(simu$agents)
the beta-value for each agents
simu$summary
: data.frame
with information about the cascades col
summary(simu$summary)
## U seeder rumor
## Min. :0.0000 Min. : 3.00 Min. : 1.0
## 1st Qu.:0.3333 1st Qu.: 45.50 1st Qu.: 4.0
## Median :0.5556 Median : 80.00 Median : 6.0
## Mean :0.5222 Mean : 90.62 Mean : 5.7
## 3rd Qu.:0.7778 3rd Qu.:137.25 3rd Qu.: 8.0
## Max. :1.0000 Max. :196.00 Max. :10.0
* __U__: the utility of the rumors spreaded by the cascade
* __seeder__: id of the agent that tweeted first the cascade
* __rumor__: id of the rumor associated with the cascade
simu$cascades
is the list of all cascades created during the simulation each element of this list represent the edglist of a cascades:simu$cascades[[1]] #show the edglesit of the first cascades generated
## [,1] [,2] [,3]
## [1,] 71 71 0
## [2,] 71 96 40
## [3,] 71 198 75
## [4,] 71 145 115
## [5,] 71 102 122
## [6,] 71 146 136
## [7,] 71 35 182
## [8,] 102 38 237
## [9,] 145 136 245
## [10,] 145 113 277
## [11,] 102 126 305
To keep space we avoid as much as possible to give name to the elements generated during the simulation. But let see what is stored in the edgeliste:
## id=sample.int(length(simu$cascades),1) #we select a random cascade
## exedge=simu$cascades[[id]] #show the edglesit of the cascade
colnames(exedge)= c("nodeA","nodeB","timestep")
print(exedge)
## nodeA nodeB timestep
## [1,] 128 128 0
## [2,] 128 129 2
## [3,] 128 62 24
## [4,] 128 73 114
## [5,] 128 50 119
## [6,] 128 8 132
## [7,] 128 40 172
## [8,] 129 77 228
## [9,] 62 191 243
## [10,] 8 112 272
## [11,] 129 10 273
## [12,] 8 101 310
## [13,] 40 195 340
## [14,] 40 115 358
Each line represent a link between nodeA
to nodeB
create a the given timestep
, the first row correspond to the ID that tweeted first the message that will be propagate by this cascade.
Knowing that we could say in our case:
“the agent 129 retweeted 128 at the time step 2 in the 13th cascade started by agent 128”.
As the orden between simu$cascades
and simu$summary
is the same, it is easy to get the utility of a given cascade by using its id. On can just check: simu$summary$U[id]
= 1
The cascades in the list of cascades are represented as edge list. Edge list are a common way to represent networks, and can then easily manipulated by any program able to work with network (ie Gephi, networkx,…)
He we will give some example using the R package IGraph
### IGRAPH & Network Visualization
par(mfrow=c(1,1),oma=rep(0,4),mar=rep(0,4))
### Be carefull the the version you are installing is > 1.0 as so changes has been made after the versions < 1.0. One solution is to use the development version using devtools::install <- github("igraph/rigraph")
library(igraph)
ig=graph_from_edgelist(exedge[,c(1,2)]) #transform the cascade we picked up into a igraph graph
# note that graph_from_edgelist doesn't allow to use a third column as the edge lenght,
plot(ig,layout=layout_as_tree) #Layout take a function as arguement that will take care of how to set up the node of thr graph. Here we want it to be a tree
If the function
graph_from_edgelist
found integer to designet the node of a edge, it will automatically generate missing node and put them as disconnected one node graph. To avoid that we have to change the number as characters. This can be done using the function formatEdgeList
### IGRAPH & Network Visualization
par(mfrow=c(1,1),oma=rep(0,4),mar=rep(0,4))
par(mfrow=c(1,1),oma=rep(0,4),mar=rep(0,4))
igedge = formatEdgeList(exedge)
ig = graph_from_edgelist(igedge)
plot(ig,layout=layout_as_tree,cex=.2)
It is then easy to visually check the metrics for this cascade
measurements = c(
getDepth(exedge), #the depth of the cascade _ie_ the longest path from the root to a leaf
getMaxBreadth(exedge), #the bread of the cascade ie the maximum number of node at a same unique level of the tree
nrow(exedge) #the size of the cascade _ie_ the total number of node in the tree
)
names(measurements)=c("depth","breadth","size")
measurements
## depth breadth size
## 3 7 14
We can then easily print all the network of all cascades of a simulation, and color it given the value of U.
### visualize all cascades and utility
par(mfrow=c(1,1),oma=rep(0,4),mar=rep(0,4))
ca=simu$cascades[1:9] #only the 9 first cascades
cid=simu$summary[1:9,]
cclrs=getUCols(cid$U)
allsubgraph=lapply(1:length(ca),function(j)graph_from_edgelist(formatEdgeList(ca[[j]],j)))
par(mfrow=c(3,3),mar=c(0,0,0,0),oma=c(0,0,0,0))
nada=lapply(order(cid$U),function(g){plot(allsubgraph[[g]],layout=layout_as_tree,vertex.color=cclrs[as.character(cid$U[g])],vertex.label=NA);box()})
Or follow the evolution of the network of one (or all) cascade(s).
giftree
#this should generate a video but doesn't work
maxround=max(sapply(ca,function(ic)max(ic[,3])))
for(t in seq(1,maxround,length.out=100)){
tmp=lapply(ca,function(cc)if(length(cc[cc[,3]<t,])==0) null else cc[cc[,3]<t,])
allsubgraph=lapply(1:length(tmp),function(j)graph_from_edgelist(formatEdgeList(tmp[[j]],j)))
par(mfrow=c(3,3),mar=c(0,0,0,0),oma=c(0,0,0,0))
for(g in order(cid$U)){
plot(allsubgraph[[g]],layout=layout_as_tree,vertex.color=cclrs[as.character(cid$U[g])],vertex.label=NA,vertex.size=1,edge.width=.5,edge.arrow.size=.5,edge.arrow.width=.2)
box()
}
}
One can explore more in depth the model. For example the impact of mu on the CCFD given the length of the simulations
mu=10^seq(-2,0,length.out=10) #a sequence of 20 different mu logarithmically spaced from 10^-5 to 10^-1
step=seq(1,10,length.out=4) #a sequence of different number of step
cols=heat.colors(length(mu)) #creat colors for each mu
par(mfrow=c(2,2),mar=c(5,5,3,1))
for(s in step){
plotCCFD(1,1,type="n",ylim=c(0.01,100),xlim=c(1,10000),main=paste("time=",round(s)))
sapply(1:length(mu),function(i)
{
pointsCCFD(cascades3D(time=s,N=1000,Nmax=.2,R=50,IC=10,utility=runif(50,-1,1),betadistrib=rep(0,1000),mu_c=mu[i])$size,col=alpha(cols[i],.8),type="l",lwd=2,cex=1)
}
)
}
or the number of agents
par(mfrow=c(2,2))
nagent=c(100,200,300) #a sequence of different number of step
#for(n in nagent){
# plotCCFD(1,1,type="n",ylim=c(0.0001,100),xlim=c(1,100000),main=paste("Num of Agents=",n))
# sapply(1:length(mu),function(i)
# {
# pointsCCFD(cascades3D(N=n,Nmax=.2,R=50,IC=10,utility=runif(50,-1,1),betadistrib=rep(0,n),time=s,mu_c=mu[i])$size,col=alpha(cols[i],.8),type="l",lwd=2,cex=1)
# }
# )
#}
To save space the use of Baeysian Computation is describe in another vignette here