Page 399 - Contributed Paper Session (CPS) - Volume 4
P. 399
CPS2449 Louisa Nolan et al.
Figure 2: location of road traffic sensors in England (red dots) and of
the ports analysed (blue pins)
2.2 Understanding the characteristics of high-growth companies using
novel data sources
The goal of this project was to use novel data sources to explore the
characteristics of firms with high growth. We used four data sources: the UK’s
statistical business register (the inter-departmental business register, IDBR); a
high-growth flag constructed by the Department for Business, Energy and
Industry Strategy (BEIS) from HMRC VAT data; a dataset from GlassAI, a start-
up, who shared a random sample of data from 30,000 UK company websites,
including company descriptions, sectors, mentions, news articles, job adverts
and bios; and geolocations of UK retail clusters from the Ordnance Survey.
The IDBR, high-growth flag and GlassAI data were linked, giving a total
sample of 5,500 companies, of which 8.6% were high-growth.
Supervised learning classification, using a gradient boosted classifier (GBC)
was used to identify the features of high growth firms, firstly from the IDBR
data alone, and then from the IDBR data linked to the GlassAI data.
Spatial analysis was carried out to investigate whether high growth is
related to geographical location in retail clusters, where retail clusters may be
seen as a broad proxy for density of economic activity. And finally, topic
analysis was carried out on the GlassAI textual data.
2.3 Optimus - a tool to turn free text into hierarchical datasets
Many datasets contain variables that consist of short free-text descriptions
of items or products. Optimus is a tool developed with the Department for
Environment, Food and Rural Affairs (DEFRA) to understand shipping
manifests of ferry journeys. The manifests are short, messy text descriptions of
cargo on lorries boarding ferries. The huge variation in detail, scale of
description and how items are recorded (such as incorrect spellings or
388 | I S I W S C 2 0 1 9