Full-population web crawl of .gov.uk: URLs, links, and third-level domain graph =============================================================================== Contact: tom.nicholls@oii.ox.ac.uk This dataset is the result of a full-population crawl of .gov.uk, aiming to capture a full picture of the scope of public-facing government activity online and the links between different government bodies. Data were collected between 2014-04-14 and 2014-10-29. Contents: --------- nodes-all-reduced.tsv: A filtered list of pages in .gov.uk links-all-reduced.tsv: A filtered list of links from pages in .gov.uk to other web pages (note that inlinks to .gov.uk pages were not collected, but outlinks to non-.gov.uk pages were). 3ld.graphml: A GraphML file containing a binarised link graph at third-level-domain level, along with attribute data for each node including registrant data and webometric calculations. Config: A directory with key configuration files for Heritrix, documenting how the crawl was conducted. Crawl-processing-pipeline: A directory with code for processing the original WARC files into the data found in this release. This documents the data cleaning and processing stages. Note that using this code would require the original WARC files and a number of external libraries (contact the researcher for more information). Description: ------------ The data consist of a file of individual URLs fetched during the crawl, and a further file containing pairs of URLs reflecting the HTML links between them. In addition, a GraphML file is presented for a version of the data reduced to third-level-domains, with accompanying attribute data for the publishing government organisations and calculated webometric statistics based on the third-level-domain link network. A web crawl was carried out with Heritrix, the Internet Archive's web crawler. A list of all registered domains in .gov.uk (and their www.x.gov.uk equivalents) was used as a set of start seeds. Sites outside .gov.uk were excluded; robots.txt files were respected, with the consequence that some .gov.uk sites (and some parts of other .gov.uk sites) were not fetched. Certain other areas were manually excluded, particularly crawling traps (e.g. calendars which will serve infinite numbers of pages in the past and future and those websites returning different URLs for each browser session) and the contents of certain large peripheral databases such as online local authority library catalogues. A full set of regular expressions used to filter the URLs fetched are included in the archive. On completion of the crawl, the page URLs and link data were extracted from the output WARC files. The page URLs were manually examined and re-filtered to handle various broken web servers and to reduce duplication of content where multiple views were presented onto the same content (for example, where a site was presented at both http://organisation.gov.uk/ and http://www.organisation.gov.uk/ without HTTP redirection between the two). Finally, The link list was filtered against the URL list to remove bogus links and both lists were map/reduced to a single set of files. Also included in this data release is a derived dataset more useful for high-level work. This is a GraphML file containing all the link and page information reduced to third-level domain level (so darlington.gov.uk is considered as a single node, not a large set of pages) and with the links binarised to present/not present between each node. Each graph node also has various attributes, including the name of the registering organisation and various webometric measures including PageRank, indegree and betweenness centrality. Page content ------------ The full text of all web pages was also captured as part of the crawl to allow supervised classification and corpus linguistic approaches to be applied to governments' activities online. This data is not part of this data release, as the textual data are copyright encumbered. Please contact Tom Nicholls at the details above for more information.