Wikibench ========= Wikibench contains three software modules 1. wikiloader: parses archived Wikipedia XML dumps and inserts data into the database 2. tracebench: pre-processes a Wikipedia trace so that it can be used by the load injector 3. wikijector: the load injector Wikiloader can be used independently from tracebench and wikijector. However tracebench and wikijector are to be used according to a workflow described below. Up-to-date software, documentation, datasets and publications about WikiBench can always be found at http://www.wikibench.eu/ /*******************************************************************/ wikiloader Prerequisites 1. Expat library for XML parsing. It can be downloaded from: http://sourceforge.net/projects/expat/ Some Linux distributions have this package in their software repository. Characteristics The characteristics of the program (also refered to as the dumper, in this file) are optional and can be activated using the parameters described in the following sections. 1. Deletion of records from the tables in which data will be inserted, with the purpose of avoiding primary key conflicts. 2. Temporarily disabling indexing for the tables in which data will be inserted, in order to speed up the process. Indexing will be enabled after the insertion process is over. 3. Logging the number of pages processed into a file in order to overcome failures. The user can choose an interval between which the number of pages processed will be written in the log file. 4. Restarting a failed insertion process. If the logging was enabled and the user has the last number of processed pages, it can be used to restart the process from the failure point. Run and compile The package contains a script that compiles and starts running the dumper: "wikiloader.sh". The script will show how much time the dumper worked, measured in seconds. A typic usage of the dumper is the following: ./wikiloader.sh [parameters] Execution parameters The dumper supports the following parameters: ./wikiloader.sh -h ; displays a list of the possible ; parameters of the script ./wikiloader.sh -d ; for deleting information from the ; tables: page, revision, text ./wikiloader.sh -i ; for turning indexing off before ; inserting data into the tables, ; and turning it on when finishing ; the whole process of dumping the data ./wikiloader.sh -l filepath page_interval ; for turning on logging, the filepath ; of the log file, and the page_interval ; showing between how many page counts ; the page_id is written into the log file ./wikiloader.sh -r page_id ; restart page insertion after fail from ; the page_id ./wikiloader.sh ; default The dumper can be run using any combination of the above parameters with the condition that if an option requires more than one parameters (such as "-l"), the parameters should be given in the order shown above. For example: "./wikiloader.sh -d -l filepath page_interval" will work correctly and delete the data in the tables, while during the insert statements the page count will be logged into the file. However, "./wikiloader.sh -d -l page_interval filepath" will result into an error. The default run doesn't delete the data in the tables, doesn't remove indexes from tables, doesn't log page ids to file and cannot be used to restart the dumping from a certain page id. Upon the execution a console dialog will be launched and you will be asked to provide the following additional parameters: database name, username and path to the dump file. After the sources are compiled and the script begins the execution, MySQL will prompt you for the password associated with the user. /*******************************************************************/ Workflow for tracebench and wikijector Prerequisites 1. Sun Java - 1.5, 1.6 2. Java - MySQL connector that can be found in package libmysql-java. It is advised that the path to mysql.jar should be added to the CLASSPATH. 3. HttpComponents-Client and HttpComponents-Core (Can be downloaded from http://hc.apache.org/downloads.cgi) 4. Ant Step 1: Preparing the traces Using the script sort_trace.sh(which can be found in the "scripts" folder) you can prepair your traces in order to be used with WikiBench. Traces are sorted chronologically and the line numbers at the beginning of each line are deleted using the sort_trace.sh script. The script can sort data from multiple .gz archives and also select traces from a certain day, required by the user. Script usage: ./sort_trace.sh " ./sort_trace.sh ./sort_trace.sh The user can specify a result file, where the sorted traces will be saved. The name of the file has to have a ".txt" extension, even though the result of the script will be a .bz2 archive. The script processes all the .gz archives in the traces folder. If a specific date is given, in the result file will appear only the traces from that date. Step 2: tracebench After the traces are sorted they can be piped into tracebench. In order to run tracebench it is necessary to first build the TraceBench.jar. ant build -Dconnector= java -jar build/lib/TraceBench.jar '' Parameters: - reduction percentage, an integer between 0 and 100. If it is 0, this tool will be quicker as it will only remove unwanted trace lines without further sampling. - standard MySQL URI for the MediaWiki database. For more information please check: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html. ex: jdbc:mysql://localhost/wikidb?user=root&password=pass - sampling method plsampling - page level sampling. It depends completely on the title. This way, we can reduce the amount of pages that are in the traces by removing selected page names from the trace completely. There are many ways of requesting a page and related services concerning a page name. Some very rare things might be missed by this parser. For the benchmark, this does not influence the results considering the extremely small percentages of these rare requests. sampling - We consider some pages and files as static, even though they are not. Obvious static files are the skin-1.5/ files, /images/ files and the css files. The x most popular wiki pages are considered and sampled like static files, because we would not want to risk that such files are removed by page-level sampling (which would reduce the size considerably!). Such pages include the Main_Page, and the css and javascript 'pages' - time stamp of the traces. In later traces, date_ts is used. tracebench uses standard input and output. Its output can be further archived or piped directly into WikiBench. Step 3: wikijector wikijector can run in two modes: controller and worker. The controller must be started first and the number of workers specified as an input parameter. After that the declared numbered of workers can be started one by one. Basic usage: Run controller: ./wikijector.sh controller Run controller in verbose mode: ./wikijector.sh vcontroller Run worker: /wikijector.sh worker When running the wikijector, an URL of the wikipage is used to launch a HTTP request. This URL is based on the traces sorted and sampled in the previous steps. The URL is composed using a relative path that in the sorted traces has the following form /wiki/Main_Page. "wiki" is the name of the Wikipedia installation found in the foloder "/var/www/". The wikijector software changes this URL to "/mediawiki/index.php/Main_Page" in order to be used by the software. If you are using your own mediawiki installation be sure to modify the source file wikibench/threads/FetchThread.java and change "/mediawiki" with the name of your installation of mediawiki. This is only valuable for content pages. For css and script files you also need to replace "/w/" with the name of your mediawiki installation. Full tracebench - wikijector workflow: a. Run controller: If you are running this software from the WikibenchServer virtual machine, there is no need for building or running tracebench as the machine contains a large variety of sampled traces from the interval 23 – 31 January 2010. This sampled traces are in the folder /home/wikiuser/samples. gunzip -c plasampling.xxx.gz | ./wikijector.sh controller If you are running this software from your own machine we reccomend to first build tracebench: ant build -Dconnector= And after that pipe its output into the wikijector controller. bunzip2 -c | java -jar build/lib/TraceBench.jar '' | ./wikijector.sh controller b. Run workers: /wikijector.sh worker