I have printed out this version and will hand it in. Mutilate at will, tell me what I did wrong, what to do next ect.
=== Abstract ===
I use a model of Wikipedia to attempt to explain the growth of it.
Unfortunately, while the model I use does have explanatory power, I
am unable to explain many of the coefficients.
=== Introduction to Wikipedia ===
Wikipedia, in a nutshell, is an online, multilingual, encyclopedia
which can be edited by anyone with an internet connection. It was
begun on January 15, 2001 as an experiment to determine whether a less
formal encyclopedia (compared to the more formal Nupedia) could be
developed in an `open source' manner (see Britannica or Nupedia? The Future of
Free Encyclopedias, http://www.kuro5hin.org/story/2001/7/25/103136/121
). The most unique aspect of it is that almost every page on the site
has a edit this page link (exceptions being pages like the front page
that are especially prone to vandalism). If you click on this link,
you are taken to a page where you can edit the article and make any
changes that you want. New articles are created by following a link
to a article that has no text.
Of course, this also means that vandalism is very easy. Hence,
detecting and undoing vandalism must correspondingly be easy. There
are two major features that help this. The first is that a complete
record of every edit and every version of any article is kept and made
available. As part of this, it is very easy to go to a article and
choose one of the older versions and make it the current version,
thereby removing the subsequent vandalism (called a revert). The
other feature is each person who is logged in gets a watchlist that
shows when articles they are interested in change. They can then see
the exact words that have been changed in the article. This allows
edits to Wikipedia to be carefully examined and reverted if they are
vandalisms without incurring a large time cost.
Wikipedia currently has over 350,000 articles and there are 10 languages
with more than 10,000 articles. It is gaining hundreds of new articles a
day and there are around 10,000 edits every day. These are impressive figures
for a encyclopedia that depends entirely on volunteer effort. The fact
that the entire database of edits is downloadable makes examining Wikipedia
further very interesting.
=== Coase's Penguin ===
The only mention of Wikipedia in a journal that I have found is the
paper ''Coase's Penguin, or, Linux and the Nature of the Firm''
(Yochai Benkler, Yale Law Journal, Volume 112, Number 3, December
2002). This paper examines several instances of creation of freely
available informational and cultural works that anyone can contribute
to, called peer production. The paper concludes that a major factor
in helping these works get created is that transaction costs have been
cut substantially compared to firm production or market production.
In the context of Wikipedia, the relevant cost that has been cut is
determining who is best to work on a given encyclopedia article. Each
person who uses Wikipedia has a very good idea of their individual
cost and the benefit of improving a particular article. If their
individual cost is less then their individual benefit, then the
individual can make the improvement. Wikipedia has access to far more
individuals than a firm, so it is much more likely that a low cost,
high benefit individual can be found. Also the firm will not be able
to costlessly determine the best individual in the firm. Trying to
replicate Wikipedia with a market, would either involve contracting
less optimal individuals, or trying to contract thousands of people
for small amounts of work. The search costs and the contracting costs
involved with the latter would be huge. So, the peer production that
occurs in Wikipedia may very well be the most efficient way to produce
an encyclopedia since the transaction costs of producing it with a
firm or by a market are substantially greater.
=== Effect of edits and authors, Costs and Benefits ===
A edit that improves an article has two effects
#Increases the overall quality of Encyclopedia
#Increases quality of Article
The first effect is expected to increase visitors and hence edits.
The second effect is expected to decrease visitors that are capable of
improving the article. An edit that is done by a different author is
expected to have even more of an effect, since it will bring new ideas
and perspectives.
So, as an article gets closer to the perfect article, the benefit of
an additional edit will decrease. The cost will still be similar, or may
even go up as the number of people capable of improving the article
decreases and the amount of rewriting work increases.
The effect on the encyclopedia should be that more quality articles
will bring in more people to read and potentially edit articles. This
effect is in the opposite direction of the effect on the article.
=== Data gathered ===
The first thing that was done with the entire Wikipedia download was
that it was run through a fast preprocessing program to remove the
information that I was not interested in. The only information that was
left for each edit was article title, author name, article checksum,
number of links in article, edit date/time, and flags for the type of
the edit and article (such as name-space, redirect, minor edit). The
main thing that this removed was the article text which greatly
reduced the amount of data that needed to be dealt with.
The next processes continued to remove extra information that was not
needed. First, all non-articles where removed. The definition of an
article is the standard Wikipedia definition, a article has at least
one internal link, is not a redirect and is in the main name-space (as
in is not an image, a talk page or otherwise). Next reverts were
removed. Any article that had the pattern A,B,C where A and C had the
same checksum and length, and B and C had different authors, was
considered a revert, and changes B and C where not counted for any
subsequent statistics. The numbers of reverts were kept track of by
month and encyclopedia.
The last processing was to get the data into a form that could have
OLS done to it. For each month, the total number of articles in
various categories was calculated (an example category would be
articles that had 2 to 5 authors and 5 to 10 edits and would be
written as AE:2to5_5to10). Also the number of bot edits was
calculated (any edit by a user listed on Wikipedia:Bots) so that this
could be figured into the calculation and disregarded.
This produced data with the following summary statistics:
Summary Statistics for 456 data points over 38 encyclopedias
Mean
Median
Standard Deviation
Minimum
Maximum
Sum
total_delta
842.27
126.5
2450.28
0
38843
384073
edits_delta
5255.31
367
15561.04
0
116898
2396423
total
7276.39
405
23599.69
1
186355
3318032
edits
35521.20
1314
138666.29
1
1297188
16197666
reverts
37.32
0
183.98
0
1760
17020
bot_count
119.10
0
1688.20
0
34882
54310
bot_create
84.61
0
1503.90
0
31854
38584
bot_total
203.71
0
2345.65
0
34887
92894
AE:0to1_0to1
1933.11
167.5
5460.36
0
51221
881500
AE:0to1_2to3
477.05
53
1337.89
0
10171
217536
AE:0to1_4to5
72.27
5
228.82
0
1918
32954
AE:0to1_6toplus
36.55
3
118.61
0
1038
16667
AE:2to4_2to3
2195.05
66
8154.28
0
59328
1000944
AE:2to4_4to5
843.95
21.5
2639.70
0
23103
384843
AE:2to4_6to10
392.58
12
1306.26
0
12028
179018
AE:2to4_11toplus
70.83
2
251.12
0
2356
32298
AE:5to10_4to5
88.81
0
294.84
0
2005
40499
AE:5to10_6to10
544.21
1
1934.14
0
15512
248159
AE:5to10_11to20
296.54
1
1188.86
0
10697
135221
AE:5to10_21toplus
44.92
1
188.50
0
1808
20484
AE:11toplus_11to20
105.10
0
484.99
0
4279
47925
AE:11toplus_21toplus
175.40
1
949.83
0
9942
79984
Here are the averages in a table ordered by the number of authors
going down and the number of edits going right. Note that the
categories were chosen based on trying to make sure that each one had
a reasonable number of articles and some were combined to ensure this
(for example AE:0to1_6toplus is a combined article).
Average Number of Articles for Categories
Average
0to1
2to3
4to5
6to10
11to20
21plus
0to1
1933.1140
477.0526
72.2675
36.5504
2to4
2195.0526
843.9539
392.5833
70.8289
5to10
88.8136
544.2083
296.5373
44.9211
11plus
105.0987
175.4035
=== Equation and Coefficients ===
Δtotal = Beta0 + Beta1bot_create + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus
On the left is Δtotal. This is the change in the total number
of articles in a month for a given encyclopedia. The intercept is
expected to be positive, since the data is only on encyclopedias that
have actually been started, and to get started they need to go from zero
pages to some pages, even though all the variables are zero. The next
variable, bot_create, is the number of articles that were created that
month by computer programs, referred to as bots. This should be close
to one since one bot created article in the month will result in one
more article being created. (It might possibly spur on human authors,
but that is unlikely in a month's time.)
The rest of the variables are variables that are dependent on the
structure of the articles. These variables are expected to have
coefficients that increase as the number of authors and edits
increases, since based on the model of Wikipedia growth, articles with
more authors and/or more edits are expected to be of higher quality,
and so should draw in more readers, some of whom then proceed to
create new articles. On the other hand, as the article ages, more of
the links that it has to other articles will be to already existing
articles, so I would expect that there will be some decrease in the
value of the coefficients as the number of authors and edits
increases.
Δedits = Beta0 + Beta1bot_total + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus
The second equation is trying to predict Δedits, or the number
of new edits done in a month. The only different variable is that
instead of bot_create, bot_total, or the number of edits done by bots,
is used. This should have a coefficient of one since one edit by a
bot should create approximately one edit in that month (plus or minus
any discouragement or encouragement of humans factor). The
coefficients on the article categories should be somewhat similar to
the ones on the Δtotals equation since some of the same effects
are occuring. Of course, the coefficients should be greater in
magnitude than the ones on Δtotals equation since you only have
to create an article once, but you have to edit it multiple times to
get it to become a high quality article.
In general, I would expect that coefficients should be positive except
when one of two things is happening. They both depend on the fact that new
authors are joining and old authors are leaving. If the current mix of
articles decreases the amount of new authors entering, then that is
actually having a negative effect on the number of new edits done.
So, if the current mix of articles is of poor quality, more potential
authors might get discouraged with the poor quality of Wikipedia, and
never join. On the other hand, this might just cause them to start
editing. The way to tell would be that the low quality articles would
possibly cause more edits to be done and less new articles to be
created. The other possible cause of negative coefficients is high
quality articles. These would tend to discourage new authors since no
improvements that can be made will be found.
=== The Regression ===
Both equations were regressed on the data. Below is the Δtotals result:
R2 = 0.8664
ΔTotals Regression Results
Coefficients
Standard Error
Intercept
177.9947
52.6005
bot_create
1.0538
0.0337
AE:0to1_0to1
0.0248
0.0283
AE:0to1_2to3
2.2715
0.5569
AE:0to1_4to5
-14.2698
4.4896
AE:0to1_6toplus
4.7155
6.7709
AE:2to4_2to3
0.0592
0.0392
AE:2to4_4to5
0.0302
0.3884
AE:2to4_6to10
0.5713
1.1788
AE:2to4_11toplus
1.3863
5.0708
AE:5to10_4to5
5.0926
2.8107
AE:5to10_6to10
-0.6224
0.9268
AE:5to10_11to20
2.4462
1.5104
AE:5to10_21toplus
-21.7014
7.8333
AE:11toplus_11to20
-4.5929
1.5457
AE:11toplus_21toplus
2.5271
0.8357
Well, it has a reasonably high R2, the intercept is
positive and the value for bot_create is close to one. Other than
that, I have to say the values on the coefficients surprise me and I
have no good story to explain them. The only two that are significant
at a 95% confidence level and are positive are one author, 2 to 3
edits and 11 or more authors, 21 or more edits. It is possible that
the former demonstrates some kind of new article with lots of empty
links, and the latter demonstrates the high quality encyclopedia
attraction effect, but it's also possible that the data is just biased
on something else. Some other ones that are significant and negative
such as AE:5to10_21toplus and AE:11toplus_11to20 do not follow a
pattern that I can see.
Below are the structural coefficients arranged is a table:
Well, it has an even higher R2, a positive intercept, and
the right value on the coefficient for bot_total. On the other hand,
I can't think of a good explanation for the coefficients on
AE:5to10_4to5 (+), AE:5to10_6to10 (-), AE:5to10_11to20 (+),
AE:5to10_21toplus (-), and AE:11toplus_11to20 (-). Also, the value on
AE:5to10_21toplus seems much lower than I would expect. I am quite
suspicious that some of the coefficients are picking up an excluded
variable bias since they seem inexplicable. My best guess for a
candidate is some kind of large encyclopedia effect is affecting the
higher edit and author counts.
95% confidence intervals for ΔEdits coefficients
Edits
0to1
2to3
4to5
6to10
11to20
21plus
0to1
-0.2363
0.1388
2.0017
9.5829
-68.5579
-7.3759
-23.7075
69.0244
2to4
-0.1165
0.4505
-1.3767
3.9628
-7.9265
8.2607
-21.4096
48.4426
5to10
35.6722
72.5160
-21.8929
-9.6498
29.1537
49.2346
-181.4473
-71.4612
11plus
-25.8172
-4.5761
-1.5569
10.0016
=== Conclusions ===
Something odd is happening with the data. It seems to explain quite a
bit of the variation, but on the other hand, I would not have expected
the signs on the coefficients that I have seen. I suspect that I will
have to examine the article level data very carefully to try and
explain the values that I am getting. The aggregate data that I am
using does not give sufficient insight into the data to try and give a
good explanation of it. I suspect that I will have to work closer
with individual articles to explain some of the effects seen.
*[http://www.honors.montana.edu/~jjc/new_stats4.txt the data]
*[http://www.honors.montana.edu/~jjc/wikipedia_programs.tar.gz the programs used]
Jrincayc/Wikipedia Growth Paper
Looks like it is going to be a fascinating paper. I was just wondering how you were going to operationalize “good edits” and “good articles” in your data collection.
:I wonder that myself. (That happens to be my major problem with the paper. User:Jrincayc 05:53, 11 Dec 2003 (UTC)
''Production of Wikipedia Content''
The first thing I thought of was to use standard production theory and treat “contributions” as a variable input.
Contributions could be defined as some combination of “length of article” plus “number of edits” I guess. Some sort of Total Product curve, Average product curve, and Marginal product curve could be created to indicate the areas of increasing returns to contributions, decreasing returns to contribution, optimum level of contribution, etc. But I don’t know if that really helps you to define “improvement in an article”. I look forward to further installments on this very interesting topic.
:As the number of edits/size of article increases, the onset of diminishing returns to contributions (per article) is complicated by positive network externalities at the systemic level. User:Mydogategodshat 17:45, 12 Dec 2003 (UTC)
::Yes, so somehow any usefull model is going to have to take into account both some approximation on effect to article and effect to encyclopedia. How to seperate the effects is going to be hard. User:Jrincayc 15:37, 14 Dec 2003 (UTC)
== Model V3 Proposal ==
My next idea for a model (the second version was the one that was used in the handed in paper) is to work at the article level. For each article, try and predict the number of edits done to the article. Variables to try and predict from will be number of months since last edit, number of previous edits, number of previous authors, some author/edit interaction terms, various encyclopedia size statistics (total articles, total edits, articles with more than twenty authors ...). This will hopefully be able to tell the encyclopedia's effect on the article, and compare that to the articles effect on the article. This might be able to tease out some of the two seperate effects. User:Jrincayc 15:37, 14 Dec 2003 (UTC)
:Just one comment: Using a single regression equation with "number of edits" as the dependent variable may entail validity problems. In particular, can "number of edits" really act as a proxy for "quality of article"? Maybe, but there are many very POV articles that receive heavy editing. I suggest you regress the independent variables against "article size" as well as "number of edits", that is, do the procedure twice. Neither article size or # of edits is a really good proxy for article quality, but if you regress against each of them, you will be able to compare.
:OK, another comment now that I think of it: I understand you used OLS. What do you think about using a stepwize regression? This might be useful given that some of your independent variables will have very high explanitory power (such as "number of previous edits"), and some will have very low. It would also be useful in checking your interaction terms.
:User:Mydogategodshat 06:58, 12 Feb 2004 (UTC)