I wanted to do something in MySQL with a large-N dataset. Raw data at the level of individual imports and exports can be acquired from the British government here. Data are for August 2024.
The first problem, in fact, the hardest part was actually loading the data into MySQL. Firstly, the data are published as a text file in one long meaningless string, as shown below.

Fortunately, I was able to figure out what these meant using the ‘Bulk data set technical specifications 2023’ excel file provided by the government. The trick is that the long string can be broken down into meaningful data depending on the positions of characters within the string. Once you know where the various substrings begin and end, then you can tell MySQL what to do in order to get the data entered correctly into your table. So for exports, I entered he following code, which creates new columns for all the variables in the dataset and then adds the data into them.
CREATE SCHEMA trade;
USE trade;
CREATE TABLE `exports` (
`code` text)
;
LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/exports 2408b.csv'
INTO TABLE exports
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
;
SELECT *
FROM exports
;
SELECT
SUBSTRING(code,1,6) AS PERREF,
SUBSTRING(code,7,1) AS TYPE,
SUBSTRING(code,8,6) AS MONTHAC,
SUBSTRING(code,14,8) AS COMCODE,
SUBSTRING(code,22,5) AS SITC,
SUBSTRING(code,27,3) AS COD_SEQ,
SUBSTRING(code,30,2) AS COD_ALPHA,
SUBSTRING(code,32,3) AS PORT_SEQ,
SUBSTRING(code,35,3) AS PORT_CODE,
SUBSTRING(code,38,3) AS COO_SEQ,
SUBSTRING(code,41,2) AS COO_ALPHA,
SUBSTRING(code,43,2) AS MODE_OF_TRANSPORT,
SUBSTRING(code,45,12) AS STAT_VALUE,
SUBSTRING(code,57,12) AS NET_MASS,
SUBSTRING(code,69,12) AS SUPP_UNIT,
SUBSTRING(code,81,1) AS SUPPRESSION,
SUBSTRING(code,82,3) AS FLOW,
SUBSTRING(code,85,1) AS REC_TYPE
FROM exports
;
ALTER TABLE exports
ADD PERREF VARCHAR(255),
ADD TYPE VARCHAR(255),
ADD MONTHAC VARCHAR(255),
ADD COMCODE VARCHAR(255),
ADD SITC VARCHAR(255),
ADD COD_SEQ VARCHAR(255),
ADD COD_ALPHA VARCHAR(255),
ADD PORT_SEQ VARCHAR(255),
ADD PORT_CODE VARCHAR(255),
ADD COO_SEQ VARCHAR(255),
ADD COO_ALPHA VARCHAR(255),
ADD MODE_OF_TRANSPORT VARCHAR(255),
ADD STAT_VALUE VARCHAR(255),
ADD NET_MASS VARCHAR(255),
ADD SUPP_UNIT VARCHAR(255),
ADD SUPPRESSION VARCHAR(255),
ADD FLOW VARCHAR(255),
ADD REC_TYPE VARCHAR(255)
;
UPDATE exports
SET PERREF = SUBSTRING(code,1,6),
TYPE = SUBSTRING(code,7,1),
MONTHAC = SUBSTRING(code,8,6),
COMCODE = SUBSTRING(code,14,8),
SITC = SUBSTRING(code,22,5),
COD_SEQ = SUBSTRING(code,27,3),
COD_ALPHA = SUBSTRING(code,30,2),
PORT_SEQ = SUBSTRING(code,32,3),
PORT_CODE = SUBSTRING(code,35,3),
COO_SEQ = SUBSTRING(code,38,3),
COO_ALPHA = SUBSTRING(code,41,2),
MODE_OF_TRANSPORT = SUBSTRING(code,43,2),
STAT_VALUE = SUBSTRING(code,45,12),
NET_MASS = SUBSTRING(code,57,12),
SUPP_UNIT = SUBSTRING(code,69,12),
SUPPRESSION = SUBSTRING(code,81,1),
FLOW = SUBSTRING(code,82,3),
REC_TYPE = SUBSTRING(code,85,1)
;
Then I had to trim some unnecessary leading zeros which was handled with this code:
UPDATE exports
SET STAT_VALUE = TRIM(LEADING '0' FROM STAT_VALUE),
NET_MASS = TRIM(LEADING '0' FROM NET_MASS),
SUPP_UNIT = TRIM(LEADING '0' FROM SUPP_UNIT)
;
This led to a further problem where by the variables STAT_VALUE (value of the export) and NET_MASS were treated as VARCHAR rather than integers which created problems for me later on when I wanted to sort my data. This was handled by the following code. Part of the problem was there were rows with non-numeric data which had to be turned into Null scores.
SELECT *
FROM exports
WHERE STAT_VALUE REGEXP '[^0-9]'
OR NET_MASS REGEXP '[^0-9]'
;
SELECT COUNT(*)
FROM exports
WHERE STAT_VALUE IS NULL
OR STAT_VALUE = ''
OR STAT_VALUE REGEXP '[^0-9]'
;
SELECT COUNT(*)
FROM exports
WHERE NET_MASS IS NULL
OR NET_MASS = ''
OR NET_MASS REGEXP '[^0-9]'
;
UPDATE exports
SET STAT_VALUE = NULL
WHERE STAT_VALUE IS NULL
OR STAT_VALUE = ''
OR STAT_VALUE REGEXP '[^0-9]'
;
UPDATE exports
SET NET_MASS = NULL
WHERE NET_MASS IS NULL
OR NET_MASS = ''
OR NET_MASS REGEXP '[^0-9]'
;
ALTER TABLE exports
MODIFY STAT_VALUE INT,
MODIFY NET_MASS INT
;
The next challenge was to turn the SITC column into something meaningful. SITC stands for Standard International Trade Classification. Each commodity is assigned a 5-digit code, for example the code 00111 stands for ‘Purebred Bovine Breeding Animals’. The first number is the broadest category with each subsequent number providing more detail. So for the code 00111 we have:
0 – ‘Food And Live Animals’
00 – ‘Live Animals Other Than Fish; Crustaceans; Molluscs And Aquatic Invertebrates Of Division 03’
001 – ‘Live Animals Other Than Animals Of Division 03’
0011 – ‘Bovine Animals; Live’
00111 – ‘Purebred Bovine Breeding Animals’
Using this resource, I was able to create tables for each level of code in MySQL. I also used the following code to create new columns, breaking up the SITC code in the original dataset to extract codes at 1, 2, 3, 4 and 5 digit-level:
SELECT SITC AS sitc5,
LEFT(SITC, 4) AS sitc4,
LEFT(SITC, 3) AS sitc3,
LEFT(SITC, 2) AS sitc2,
LEFT(SITC, 1) AS sitc1
FROM imports
;
ALTER TABLE imports
ADD COLUMN sitc4 CHAR(4),
ADD COLUMN sitc3 CHAR(3),
ADD COLUMN sitc2 CHAR(2),
ADD COLUMN sitc1 CHAR(1)
;
UPDATE imports
SET sitc4 = LEFT(SITC, 4),
sitc3 = LEFT(SITC, 3),
sitc2 = LEFT(SITC, 2),
sitc1 = LEFT(SITC, 1)
;
With these columns created I could then start to produce aggregations using the GROUP BY function. But first the above process was repeated for the imports data set.
Now it’s time to start running some queries.
First I wanted to know what the total sum of the value of all imports and exports was for August 2024. This was done using the following query:
SELECT 'imports' AS source, SUM(STAT_VALUE) AS im
FROM imports
UNION ALL
SELECT 'exports', SUM(STAT_VALUE) AS ex
FROM exports
;
This gives us the following:
Imports | £51,481,774,357 |
Exports | £31,103,421,517 |
This query then gives us the balance of trade:
SELECT
(SELECT SUM(STAT_VALUE) FROM exports) - (SELECT SUM(STAT_VALUE) FROM imports) AS difference
;
Which comes up as:
Balance of trade | -£20,378,352,840 |
Next, I wanted to breakdown the value of exports by 1-digit SITC code – the broadest category, in order to see what we were exporting. This required joining the exports table to another table with the labels for each category:
SELECT description, # name of category
SUM(STAT_VALUE) AS total_stat_value # value of exports
FROM exports AS e
JOIN sitc1 AS s
ON e.sitc1 = s.code
GROUP BY description
ORDER BY total_stat_value DESC
;
This returns the following:
MACHINERY AND TRANSPORT EQUIPMENT | £11,093,577,167 |
CHEMICALS AND RELATED PRODUCTS; N.E.S. | £4,602,101,550 |
COMMODITIES AND TRANSACTIONS NOT CLASSIFIED ELSEWHERE IN THE SITC | £4,252,280,267 |
MANUFACTURED GOODS CLASSIFIED CHIEFLY BY MATERIAL | £3,257,213,669 |
MISCELLANEOUS MANUFACTURED ARTICLES | £3,209,819,894 |
MINERAL FUELS; LUBRICANTS AND RELATED MATERIALS | £2,400,327,185 |
FOOD AND LIVE ANIMALS | £1,304,361,811 |
CRUDE MATERIALS; INEDIBLE; EXCEPT FUELS | £705,225,463 |
BEVERAGES AND TOBACCO | £219,102,751 |
ANIMAL AND VEGETABLE OILS; FATS AND WAXES | £39,727,747 |
So, where are all these going to? The following query provides the total value of exports as broken down by country of destination, with just the top ten provided. Note, a table was created to source the names of countries from since all that was provided in the exports data were two-letter country codes. The query performs an inner join in order to bring the country names in.
SELECT Country,
SUM(STAT_VALUE) AS total_stat_value
FROM exports AS e
JOIN countries AS c
ON e.COD_ALPHA = c.Code
GROUP BY Country
ORDER BY total_stat_value DESC
LIMIT 10
;
The top ten countries for exports were:
United States of America | £4,134,401,111 |
Germany | £2,571,978,825 |
Netherlands | £2,158,561,463 |
Ireland | £1,794,609,416 |
Switzerland | £1,556,118,536 |
France | £1,530,179,036 |
Belgium | £1,489,460,491 |
India | £1,091,373,767 |
China | £1,010,313,249 |
Hong Kong | £739,050,474 |
Finally, I wanted to do some more specific queries. This query tells me specifically the value of all imports and exports of ketchup.
SELECT 'imports' AS source, SUM(STAT_VALUE) AS im
FROM imports
WHERE SITC = 09842 # ketchup
UNION ALL
SELECT 'exports', SUM(STAT_VALUE) AS ex
FROM exports
WHERE SITC = 09842
;
It appears we import much more ketchup than we export:
Imports of ketchup | £17,816,935 |
Exports of ketchup | £2,840,002 |
Now the data are entered and we’ve learnt a little bit about them, the next step is to try and visualise them using Tableau. Check out Part 2 of this project of mine.