Jrincayc/Wikipedia Growth Paper - meaning of word
Rozmiar: 8938 bajtów


Jrincayc/Wikipedia Growth Paper



I have printed out this version and will hand it in. Mutilate at will, tell me what I did wrong, what to do next ect. === Abstract === I use a model of Wikipedia to attempt to explain the growth of it. Unfortunately, while the model I use does have explanatory power, I am unable to explain many of the coefficients. === Introduction to Wikipedia === Wikipedia, in a nutshell, is an online, multilingual, encyclopedia which can be edited by anyone with an internet connection. It was begun on January 15, 2001 as an experiment to determine whether a less formal encyclopedia (compared to the more formal Nupedia) could be developed in an `open source' manner (see Britannica or Nupedia? The Future of Free Encyclopedias, http://www.kuro5hin.org/story/2001/7/25/103136/121 ). The most unique aspect of it is that almost every page on the site has a edit this page link (exceptions being pages like the front page that are especially prone to vandalism). If you click on this link, you are taken to a page where you can edit the article and make any changes that you want. New articles are created by following a link to a article that has no text. Of course, this also means that vandalism is very easy. Hence, detecting and undoing vandalism must correspondingly be easy. There are two major features that help this. The first is that a complete record of every edit and every version of any article is kept and made available. As part of this, it is very easy to go to a article and choose one of the older versions and make it the current version, thereby removing the subsequent vandalism (called a revert). The other feature is each person who is logged in gets a watchlist that shows when articles they are interested in change. They can then see the exact words that have been changed in the article. This allows edits to Wikipedia to be carefully examined and reverted if they are vandalisms without incurring a large time cost. Wikipedia currently has over 350,000 articles and there are 10 languages with more than 10,000 articles. It is gaining hundreds of new articles a day and there are around 10,000 edits every day. These are impressive figures for a encyclopedia that depends entirely on volunteer effort. The fact that the entire database of edits is downloadable makes examining Wikipedia further very interesting. === Coase's Penguin === The only mention of Wikipedia in a journal that I have found is the paper ''Coase's Penguin, or, Linux and the Nature of the Firm'' (Yochai Benkler, Yale Law Journal, Volume 112, Number 3, December 2002). This paper examines several instances of creation of freely available informational and cultural works that anyone can contribute to, called peer production. The paper concludes that a major factor in helping these works get created is that transaction costs have been cut substantially compared to firm production or market production. In the context of Wikipedia, the relevant cost that has been cut is determining who is best to work on a given encyclopedia article. Each person who uses Wikipedia has a very good idea of their individual cost and the benefit of improving a particular article. If their individual cost is less then their individual benefit, then the individual can make the improvement. Wikipedia has access to far more individuals than a firm, so it is much more likely that a low cost, high benefit individual can be found. Also the firm will not be able to costlessly determine the best individual in the firm. Trying to replicate Wikipedia with a market, would either involve contracting less optimal individuals, or trying to contract thousands of people for small amounts of work. The search costs and the contracting costs involved with the latter would be huge. So, the peer production that occurs in Wikipedia may very well be the most efficient way to produce an encyclopedia since the transaction costs of producing it with a firm or by a market are substantially greater. === Effect of edits and authors, Costs and Benefits === A edit that improves an article has two effects #Increases the overall quality of Encyclopedia #Increases quality of Article The first effect is expected to increase visitors and hence edits. The second effect is expected to decrease visitors that are capable of improving the article. An edit that is done by a different author is expected to have even more of an effect, since it will bring new ideas and perspectives. So, as an article gets closer to the perfect article, the benefit of an additional edit will decrease. The cost will still be similar, or may even go up as the number of people capable of improving the article decreases and the amount of rewriting work increases. The effect on the encyclopedia should be that more quality articles will bring in more people to read and potentially edit articles. This effect is in the opposite direction of the effect on the article. === Data gathered === The first thing that was done with the entire Wikipedia download was that it was run through a fast preprocessing program to remove the information that I was not interested in. The only information that was left for each edit was article title, author name, article checksum, number of links in article, edit date/time, and flags for the type of the edit and article (such as name-space, redirect, minor edit). The main thing that this removed was the article text which greatly reduced the amount of data that needed to be dealt with. The next processes continued to remove extra information that was not needed. First, all non-articles where removed. The definition of an article is the standard Wikipedia definition, a article has at least one internal link, is not a redirect and is in the main name-space (as in is not an image, a talk page or otherwise). Next reverts were removed. Any article that had the pattern A,B,C where A and C had the same checksum and length, and B and C had different authors, was considered a revert, and changes B and C where not counted for any subsequent statistics. The numbers of reverts were kept track of by month and encyclopedia. The last processing was to get the data into a form that could have OLS done to it. For each month, the total number of articles in various categories was calculated (an example category would be articles that had 2 to 5 authors and 5 to 10 edits and would be written as AE:2to5_5to10). Also the number of bot edits was calculated (any edit by a user listed on Wikipedia:Bots) so that this could be figured into the calculation and disregarded. This produced data with the following summary statistics:
Summary Statistics for 456 data points over 38 encyclopedias
Mean Median Standard Deviation Minimum Maximum Sum
total_delta 842.27 126.5 2450.28 0 38843 384073
edits_delta 5255.31 367 15561.04 0 116898 2396423
total 7276.39 405 23599.69 1 186355 3318032
edits 35521.20 1314 138666.29 1 1297188 16197666
reverts 37.32 0 183.98 0 1760 17020
bot_count 119.10 0 1688.20 0 34882 54310
bot_create 84.61 0 1503.90 0 31854 38584
bot_total 203.71 0 2345.65 0 34887 92894
AE:0to1_0to1 1933.11 167.5 5460.36 0 51221 881500
AE:0to1_2to3 477.05 53 1337.89 0 10171 217536
AE:0to1_4to5 72.27 5 228.82 0 1918 32954
AE:0to1_6toplus 36.55 3 118.61 0 1038 16667
AE:2to4_2to3 2195.05 66 8154.28 0 59328 1000944
AE:2to4_4to5 843.95 21.5 2639.70 0 23103 384843
AE:2to4_6to10 392.58 12 1306.26 0 12028 179018
AE:2to4_11toplus 70.83 2 251.12 0 2356 32298
AE:5to10_4to5 88.81 0 294.84 0 2005 40499
AE:5to10_6to10 544.21 1 1934.14 0 15512 248159
AE:5to10_11to20 296.54 1 1188.86 0 10697 135221
AE:5to10_21toplus 44.92 1 188.50 0 1808 20484
AE:11toplus_11to20 105.10 0 484.99 0 4279 47925
AE:11toplus_21toplus 175.40 1 949.83 0 9942 79984
Here are the averages in a table ordered by the number of authors going down and the number of edits going right. Note that the categories were chosen based on trying to make sure that each one had a reasonable number of articles and some were combined to ensure this (for example AE:0to1_6toplus is a combined article).
Average Number of Articles for Categories
Average 0to1 2to3 4to5 6to10 11to20 21plus
0to1 1933.1140 477.0526 72.2675 36.5504
2to4 2195.0526 843.9539 392.5833 70.8289
5to10 88.8136 544.2083 296.5373 44.9211
11plus 105.0987 175.4035
=== Equation and Coefficients === Δtotal = Beta0 + Beta1bot_create + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus On the left is Δtotal. This is the change in the total number of articles in a month for a given encyclopedia. The intercept is expected to be positive, since the data is only on encyclopedias that have actually been started, and to get started they need to go from zero pages to some pages, even though all the variables are zero. The next variable, bot_create, is the number of articles that were created that month by computer programs, referred to as bots. This should be close to one since one bot created article in the month will result in one more article being created. (It might possibly spur on human authors, but that is unlikely in a month's time.) The rest of the variables are variables that are dependent on the structure of the articles. These variables are expected to have coefficients that increase as the number of authors and edits increases, since based on the model of Wikipedia growth, articles with more authors and/or more edits are expected to be of higher quality, and so should draw in more readers, some of whom then proceed to create new articles. On the other hand, as the article ages, more of the links that it has to other articles will be to already existing articles, so I would expect that there will be some decrease in the value of the coefficients as the number of authors and edits increases. Δedits = Beta0 + Beta1bot_total + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus The second equation is trying to predict Δedits, or the number of new edits done in a month. The only different variable is that instead of bot_create, bot_total, or the number of edits done by bots, is used. This should have a coefficient of one since one edit by a bot should create approximately one edit in that month (plus or minus any discouragement or encouragement of humans factor). The coefficients on the article categories should be somewhat similar to the ones on the Δtotals equation since some of the same effects are occuring. Of course, the coefficients should be greater in magnitude than the ones on Δtotals equation since you only have to create an article once, but you have to edit it multiple times to get it to become a high quality article. In general, I would expect that coefficients should be positive except when one of two things is happening. They both depend on the fact that new authors are joining and old authors are leaving. If the current mix of articles decreases the amount of new authors entering, then that is actually having a negative effect on the number of new edits done. So, if the current mix of articles is of poor quality, more potential authors might get discouraged with the poor quality of Wikipedia, and never join. On the other hand, this might just cause them to start editing. The way to tell would be that the low quality articles would possibly cause more edits to be done and less new articles to be created. The other possible cause of negative coefficients is high quality articles. These would tend to discourage new authors since no improvements that can be made will be found. === The Regression === Both equations were regressed on the data. Below is the Δtotals result: R2 = 0.8664
ΔTotals Regression Results
Coefficients Standard Error
Intercept 177.9947 52.6005
bot_create 1.0538 0.0337
AE:0to1_0to1 0.0248 0.0283
AE:0to1_2to3 2.2715 0.5569
AE:0to1_4to5 -14.2698 4.4896
AE:0to1_6toplus 4.7155 6.7709
AE:2to4_2to3 0.0592 0.0392
AE:2to4_4to5 0.0302 0.3884
AE:2to4_6to10 0.5713 1.1788
AE:2to4_11toplus 1.3863 5.0708
AE:5to10_4to5 5.0926 2.8107
AE:5to10_6to10 -0.6224 0.9268
AE:5to10_11to20 2.4462 1.5104
AE:5to10_21toplus -21.7014 7.8333
AE:11toplus_11to20 -4.5929 1.5457
AE:11toplus_21toplus 2.5271 0.8357
Well, it has a reasonably high R2, the intercept is positive and the value for bot_create is close to one. Other than that, I have to say the values on the coefficients surprise me and I have no good story to explain them. The only two that are significant at a 95% confidence level and are positive are one author, 2 to 3 edits and 11 or more authors, 21 or more edits. It is possible that the former demonstrates some kind of new article with lots of empty links, and the latter demonstrates the high quality encyclopedia attraction effect, but it's also possible that the data is just biased on something else. Some other ones that are significant and negative such as AE:5to10_21toplus and AE:11toplus_11to20 do not follow a pattern that I can see. Below are the structural coefficients arranged is a table:
95% confidence intervals for ΔTotals
Total 0to1 2to3 4to5 6to10 11to20 21plus
0to1 -0.0308 0.0803 1.1769 3.3661 -23.0935 -5.4460 -8.5918 18.0228
2to4 -0.0178 0.1361 -0.7331 0.7935 -1.7455 2.8881 -8.5797 11.3523
5to10 -0.4315 10.6167 -2.4439 1.1991 -0.5223 5.4147 -37.0967 -6.3062
11plus -7.6308 -1.5551 0.8846 4.1697
The Δedits regression yielded similarly puzzling results presented below: R2 = 0.9597
ΔEdits Regression Results
Coefficients Standard Error
Intercept 530.5434 183.8839
bot_total 0.9010 0.0858
AE:0to1_0to1 -0.0487 0.0954
AE:0to1_2to3 5.7923 1.9287
AE:0to1_4to5 -37.9669 15.5650
AE:0to1_6toplus 22.6584 23.5915
AE:2to4_2to3 0.1670 0.1442
AE:2to4_4to5 1.2930 1.3584
AE:2to4_6to10 0.1671 4.1181
AE:2to4_11toplus 13.5165 17.7708
AE:5to10_4to5 54.0941 9.3732
AE:5to10_6to10 -15.7714 3.1147
AE:5to10_11to20 39.1942 5.1087
AE:5to10_21toplus -126.4542 27.9810
AE:11toplus_11to20 -15.1967 5.4038
AE:11toplus_21toplus 4.2223 2.9406
Well, it has an even higher R2, a positive intercept, and the right value on the coefficient for bot_total. On the other hand, I can't think of a good explanation for the coefficients on AE:5to10_4to5 (+), AE:5to10_6to10 (-), AE:5to10_11to20 (+), AE:5to10_21toplus (-), and AE:11toplus_11to20 (-). Also, the value on AE:5to10_21toplus seems much lower than I would expect. I am quite suspicious that some of the coefficients are picking up an excluded variable bias since they seem inexplicable. My best guess for a candidate is some kind of large encyclopedia effect is affecting the higher edit and author counts.
95% confidence intervals for ΔEdits coefficients
Edits 0to1 2to3 4to5 6to10 11to20 21plus
0to1 -0.2363 0.1388 2.0017 9.5829 -68.5579 -7.3759 -23.7075 69.0244
2to4 -0.1165 0.4505 -1.3767 3.9628 -7.9265 8.2607 -21.4096 48.4426
5to10 35.6722 72.5160 -21.8929 -9.6498 29.1537 49.2346 -181.4473 -71.4612
11plus -25.8172 -4.5761 -1.5569 10.0016
=== Conclusions === Something odd is happening with the data. It seems to explain quite a bit of the variation, but on the other hand, I would not have expected the signs on the coefficients that I have seen. I suspect that I will have to examine the article level data very carefully to try and explain the values that I am getting. The aggregate data that I am using does not give sufficient insight into the data to try and give a good explanation of it. I suspect that I will have to work closer with individual articles to explain some of the effects seen. *[http://www.honors.montana.edu/~jjc/new_stats4.txt the data] *[http://www.honors.montana.edu/~jjc/wikipedia_programs.tar.gz the programs used]

Jrincayc/Wikipedia Growth Paper



Looks like it is going to be a fascinating paper. I was just wondering how you were going to operationalize “good edits” and “good articles” in your data collection. :I wonder that myself. (That happens to be my major problem with the paper. User:Jrincayc 05:53, 11 Dec 2003 (UTC)

''Production of Wikipedia Content''
The first thing I thought of was to use standard production theory and treat “contributions” as a variable input. Contributions could be defined as some combination of “length of article” plus “number of edits” I guess. Some sort of Total Product curve, Average product curve, and Marginal product curve could be created to indicate the areas of increasing returns to contributions, decreasing returns to contribution, optimum level of contribution, etc. But I don’t know if that really helps you to define “improvement in an article”. I look forward to further installments on this very interesting topic. :As the number of edits/size of article increases, the onset of diminishing returns to contributions (per article) is complicated by positive network externalities at the systemic level. User:Mydogategodshat 17:45, 12 Dec 2003 (UTC) ::Yes, so somehow any usefull model is going to have to take into account both some approximation on effect to article and effect to encyclopedia. How to seperate the effects is going to be hard. User:Jrincayc 15:37, 14 Dec 2003 (UTC) == Model V3 Proposal == My next idea for a model (the second version was the one that was used in the handed in paper) is to work at the article level. For each article, try and predict the number of edits done to the article. Variables to try and predict from will be number of months since last edit, number of previous edits, number of previous authors, some author/edit interaction terms, various encyclopedia size statistics (total articles, total edits, articles with more than twenty authors ...). This will hopefully be able to tell the encyclopedia's effect on the article, and compare that to the articles effect on the article. This might be able to tease out some of the two seperate effects. User:Jrincayc 15:37, 14 Dec 2003 (UTC) :Just one comment: Using a single regression equation with "number of edits" as the dependent variable may entail validity problems. In particular, can "number of edits" really act as a proxy for "quality of article"? Maybe, but there are many very POV articles that receive heavy editing. I suggest you regress the independent variables against "article size" as well as "number of edits", that is, do the procedure twice. Neither article size or # of edits is a really good proxy for article quality, but if you regress against each of them, you will be able to compare. :OK, another comment now that I think of it: I understand you used OLS. What do you think about using a stepwize regression? This might be useful given that some of your independent variables will have very high explanitory power (such as "number of previous edits"), and some will have very low. It would also be useful in checking your interaction terms. :User:Mydogategodshat 06:58, 12 Feb 2004 (UTC)


See other meanings of words starting from letter:

J

JA | JB | JC | JD | JE | JF | JG | JH | JI | JK | JL | JM | JN | JO | JP | JR | JS | JT | JU | JW | JX | JY | JZ |

Words begining with Jrincayc/Wikipedia_Growth_Paper:

Jrincayc/Wikipedia_Growth_Paper
Jrincayc/Wikipedia_Growth_Paper


These materials are based on Wikipedia and licensed under the GNU FDL



YouTube.com videos better site than Turbo Tax 2007
encyklopedia online