Full-population web crawl of .gov.uk: URLs, links, and third-level domain graph
===============================================================================

Contact: tom.nicholls@oii.ox.ac.uk

This dataset is the result of a full-population crawl of .gov.uk, aiming to 
capture a full picture of the scope of public-facing government activity online
and the links between different government bodies.

Data were collected between 2014-04-14 and 2014-10-29.


Contents:
---------
nodes-all-reduced.tsv: A filtered list of pages in .gov.uk
links-all-reduced.tsv: A filtered list of links from pages in .gov.uk to other
                        web pages (note that inlinks to .gov.uk pages were not
                        collected, but outlinks to non-.gov.uk pages were).
3ld.graphml:           A GraphML file containing a binarised link graph at
                        third-level-domain level, along with attribute data
                        for each node including registrant data and webometric
                        calculations.
Config:                A directory with key configuration files for Heritrix,
                        documenting how the crawl was conducted.
Crawl-processing-pipeline: A directory with code for processing the original
                        WARC files into the data found in this release. This
                        documents the data cleaning and processing stages. Note
                        that using this code would require the original WARC
                        files and a number of external libraries (contact the
                        researcher for more information).


Description:
------------

The data consist of a file of individual URLs fetched during the crawl, and a 
further file containing pairs of URLs reflecting the HTML links between them. 
In addition, a GraphML file is presented for a version of the data reduced to 
third-level-domains, with accompanying attribute data for the publishing 
government organisations and calculated webometric statistics based on the 
third-level-domain link network.

A web crawl was carried out with Heritrix, the Internet Archive's web crawler.
A list of all registered domains in .gov.uk (and their www.x.gov.uk
equivalents) was used as a set of start seeds.

Sites outside .gov.uk were excluded; robots.txt files were respected, with the
consequence that some .gov.uk sites (and some parts of other .gov.uk sites)
were not fetched. Certain other areas were manually excluded, particularly
crawling traps (e.g. calendars which will serve infinite numbers of pages in
the past and future and those websites returning different URLs for each
browser session) and the contents of certain large peripheral databases such
as online local authority library catalogues.  A full set of regular
expressions used to filter the URLs fetched are included in the archive.

On completion of the crawl, the page URLs and link data were extracted from the
output WARC files. The page URLs were manually examined and re-filtered to
handle various broken web servers and to reduce duplication of content where
multiple views were presented onto the same content (for example, where a site
was presented at both http://organisation.gov.uk/ and
http://www.organisation.gov.uk/ without HTTP redirection between the two).

Finally, The link list was filtered against the URL list to remove bogus links
and both lists were map/reduced to a single set of files.

Also included in this data release is a derived dataset more useful for
high-level work. This is a GraphML file containing all the link and page
information reduced to third-level domain level (so darlington.gov.uk is
considered as a single node, not a large set of pages) and with the links
binarised to present/not present between each node. Each graph node also has
various attributes, including the name of the registering organisation and
various webometric measures including PageRank, indegree and betweenness
centrality.


Page content
------------

The full text of all web pages was also captured as part of the crawl to allow
supervised classification and corpus linguistic approaches to be applied to
governments' activities online. This data is not part of this data release, as
the textual data are copyright encumbered. Please contact Tom Nicholls at the
details above for more information.