library(tidyverse)
library(broom)Introduction to Data Analysis in R
Using NHS Antibiotic Prescribing Data
Introduction
We’ll learn the fundamentals of data analysis using R and the tidyverse. Rather than working with artificial examples, we’ll use real data from the NHS, specifically, records of antibiotic prescriptions dispensed by GP practices across England.
Antibiotic resistance is one of the greatest threats to global health, and understanding prescribing patterns helps the NHS target interventions where they’re needed most. By the end of this tutorial, you’ll have learned how to read, explore, transform, summarise, and visualise data. These are skills that you can apply to any dataset you’ll encounter in your career.
The data comes from the NHSBSA Open Data Portal, which publishes the English Prescribing Dataset (EPD) monthly. Before starting the analysis, it’s worth understanding how we got the data because sourcing and preparing data is where most real-world analysis projects spend the majority of their time.
Note: For teaching speed and simplicity, this notebook deliberately uses one month of data. To change this, edit the YAML params at the top of the .Rmd or .qmd source (not the rendered HTML):
params:
use_one_month: true
sample_month: 202511
Set use_one_month: false to run all months. The repository only includes the latest‑month teaching extracts. If you want the full 12‑month data, you must download it locally using scripts/download_antibiotic_prescriptions.R.
Let’s start by loading the packages we need.
The tidyverse is a collection of R packages designed for data science. It includes readr for reading data, dplyr for manipulation, ggplot2 for visualisation, and several others. Loading tidyverse gives us access to all of them.
We also load broom for producing tidy summary tables from model output.
1. Where the Data Comes From
In real life data rarely arrives clean and ready to use. This section outlines how we sourced the data for this analysis including some challenges. If you’re interested in a career in data analysis, this is an important section to follow, because it reflects what the job actually looks like.
Problem: 7 GB per month
The NHSBSA publishes the English Prescribing Dataset monthly. Each month’s file contains every prescription dispensed in England. This includes roughly 10 million rows covering every drug and every GP practice. A single month comes in at 7 GB. We only need antibiotics (BNF section 5.1), which are about 3% of the total. Downloading 7 GB to keep 3% is wasteful, especially if we wish to do it many times over to look at trends over time.
Solution: server-side filtering via API
The NHSBSA portal provides a CKAN API that accepts SQL queries. Instead of downloading everything, we send a query like:
SELECT YEAR_MONTH, PRACTICE_CODE, BNF_CODE, ITEMS, NIC, ...
FROM EPD_SNOMED_202511
WHERE BNF_CODE LIKE '0501%'This runs on the server and returns only the antibiotic rows. This is about 200,000 rows per month instead of 10 million. The download script (scripts/download_antibiotic_prescriptions.R) automates this for 12 months, saving each month to its own file so that if the connection drops, we don’t lose progress.
The download script also fetches each practice’s total prescribing volume. This is the count of all prescription items (not just antibiotics) per practice per month. This is done with a second server-side query that groups and sums the full dataset without transferring individual rows. We use this later to compare prescribing patterns across different provider types (GP practices, walk-in centres, urgent care). For deeper analysis of GP practices specifically, we use registered patient numbers as the denominator instead. This gives us a per-capita rate that better reflects how much prescribing each practice does relative to its population.
Issue: the schema change
When we first ran the download for the latest available full year (December 2024 to November 2025), nine months downloaded successfully, but three months silently failed. December 2024, January 2025, and February 2025 returned server errors with no useful message.
After investigation, we discovered the cause: the NHSBSA changed their column names in March 2025. Before March, the antibiotic code column was called BNF_CODE. From March onwards, it became BNF_PRESENTATION_CODE. Our query asked for BNF_PRESENTATION_CODE — a column that didn’t exist in the older months.
| Column | Before March 2025 | From March 2025 |
|---|---|---|
| Drug code | BNF_CODE |
BNF_PRESENTATION_CODE |
| Chemical | BNF_CHEMICAL_SUBSTANCE |
BNF_CHEMICAL_SUBSTANCE_CODE |
| Daily dose | ADQUSAGE |
ADQ_USAGE |
The was relatively easy to fix once identified. It required detecting which schema a particular month uses and adjust the query accordingly.
Lesson: data is messy
This scenario is typical of working with real data. Data providers change column names, restructure formats, introduce inconsistencies between time periods, and rarely announce these changes prominently. The API returned an opaque “Internal Server Error”, as opposed to “column not found.” This data no doubt has a number of processes underpinning it to make it available for public consumption. The issues here could be relatively minor compared to more complex scenarios on other projects.
Debugging this required:
- Noticing the problem — three months were missing from a 12-month download
- Isolating the cause — testing individual months to find which ones failed
- Investigating the API — querying the column metadata for old vs new months
- Building a fix — writing schema-detection logic that handles both formats
This is a pattern you’ll encounter repeatedly in data work. The analysis code below is the visible part. The invisible part includes sourcing, cleaning, validating, and debugging the data and this is where most of the effort goes. Experienced analysts expect this.
Rule of thumb: In most data projects, more of the time can be spent getting the data right, than performing the analysis itself.
Demographic context data
The prescribing data tells us what was prescribed, but not why rates might vary between practices. To make fair comparisons later, we also download demographic data about each practice’s patient population from two additional sources:
- Deprivation scores from the Public Health England Fingertips API — a population-weighted Index of Multiple Deprivation (IMD) score for each practice, reflecting the socioeconomic circumstances of its patients.
- Patient list sizes and age profiles from NHS Digital’s “Patients Registered at a GP Practice” publication — the number of registered patients broken down by five-year age bands.
These are fetched by scripts/download_demographics.R and cached locally. We’ll use them in sections 7 and 8 to explore whether deprivation and age explain some of the variation in prescribing rates.
Lesson: no single dataset tells the whole story
Real-world analysis almost always requires stitching together data from multiple sources. Prescribing records alone can tell us how much was prescribed, but understanding why requires context. Deprivation indices, population demographics and organisational classifications are each held by a different provider, in a different format, with their own quirks. The ability to identify what additional data you need, find it, and link it reliably to your primary dataset is a valuable skill in data analysis.
2. Reading Data
The first step in any analysis is getting your data into suitable software such as R. Our data is stored in a CSV file (comma-separated values) which is a common format for tabular data that you can open in e.g. MS Excel.
Default is one month: This document runs in one-month mode by default. See the note above for the YAML params to change this.
Aggregated input: This version uses a pre-aggregated file at practice + chemical + month level for easier sharing. For the tutorial we use a latest available month extract to keep file sizes small.
# Read the aggregated antibiotic prescribing data (practice + chemical + month)
data_path <- "data/processed/teaching/antibiotics_practice_chemical_latest.csv"
if (!file.exists(data_path)) {
stop(sprintf("Missing %s. Generate it via scripts/download_antibiotic_prescriptions.R.", data_path))
}
antibiotics <- read_csv(data_path) %>%
filter(!is.na(practice_code), practice_code != "-") %>%
filter_to_month()
# Load lookup table for human-readable chemical substance names
chemical_names <- read_csv("data/processed/lookups/bnf_chemical_substance_lookup.csv")
# Look at the first few rows (formatted for decimal alignment)
antibiotics %>%
head(6) %>%
mutate(across(c(items, quantity, adqusage, nic), ~round(., 2))) %>%
knitr::kable(
format = "html",
digits = 2,
align = c("r", "l", "l", "r", "r", "r", "r"),
table.attr = 'style="font-variant-numeric: tabular-nums;"'
)| year_month | practice_code | bnf_chemical_substance | items | quantity | adqusage | nic |
|---|---|---|---|---|---|---|
| 202511 | 00L998 | 0501030I0 | 1 | 6 | 6.00 | 0.55 |
| 202511 | 00N998 | 0501012G0 | 1 | 20 | 10.00 | 1.46 |
| 202511 | 00N998 | 0501013B0 | 4 | 121 | 48.67 | 5.00 |
| 202511 | 00N998 | 0501015P0 | 1 | 22 | 0.00 | 11.88 |
| 202511 | 00N998 | 0501030I0 | 1 | 8 | 8.00 | 0.73 |
| 202511 | 00N998 | 0501050B0 | 1 | 14 | 14.00 | 2.28 |
The read_csv() function from the readr package is fast and makes sensible guesses about column types. The head() function shows us the first six rows, giving us a quick preview of what we’re working with.
We also remove rows with missing/placeholder practice codes (e.g. -). Our primary focus is comparing GP surgeries, so we need identifiable practices. In a different context we might consider imputation strategies, but here we keep the analysis grounded in real, traceable organisations.
3. Understanding Your Data
Before diving into analysis, we need to understand what we have. How many rows? How many columns? What does each column represent? Skipping this step is a common mistake as you can’t analyse data you don’t understand.
We saw previously how messy the journey from source to CSV can be. But having clean data isn’t enough and you also need to understand what it represents. In our case, that means knowing what a GP practice is, how BNF codes classify drugs, and that each row captures prescriptions for one practice in one month for one drug formulation. Without that domain context, it’s almost impossible to build a mental map to help identify how to proceed.
# How big is the dataset?
dim(antibiotics)[1] 139596 7
# What are the column names?
names(antibiotics)[1] "year_month" "practice_code" "bnf_chemical_substance"
[4] "items" "quantity" "adqusage"
[7] "nic"
# Get a compact summary of the data structure
glimpse(antibiotics)Rows: 139,596
Columns: 7
$ year_month <int> 202511, 202511, 202511, 202511, 202511, 202511,…
$ practice_code <chr> "00L998", "00N998", "00N998", "00N998", "00N998…
$ bnf_chemical_substance <chr> "0501030I0", "0501012G0", "0501013B0", "0501015…
$ items <dbl> 1, 1, 4, 1, 1, 1, 1, 1, 1, 2, 1, 15, 21, 91, 4,…
$ quantity <dbl> 6, 20, 121, 22, 8, 14, 6, 15, 6, 6, 210, 676, 5…
$ adqusage <dbl> 6.00000, 10.00000, 48.66667, 0.00000, 8.00000, …
$ nic <dbl> 0.55, 1.46, 5.00, 11.88, 0.73, 2.28, 4.07, 0.87…
# Statistical summary of each column
summary(antibiotics) year_month practice_code bnf_chemical_substance items
Min. :202511 Length:139596 Length:139596 Min. : 1.00
1st Qu.:202511 Class :character Class :character 1st Qu.: 2.00
Median :202511 Mode :character Mode :character Median : 7.00
Mean :202511 Mean : 18.42
3rd Qu.:202511 3rd Qu.: 21.00
Max. :202511 Max. :1101.00
quantity adqusage nic
Min. : 1.0 Min. : 0.00 Min. : 0.05
1st Qu.: 45.0 1st Qu.: 10.00 1st Qu.: 14.98
Median : 119.0 Median : 54.75 Median : 44.20
Mean : 248.9 Mean : 181.52 Mean : 102.73
3rd Qu.: 304.0 3rd Qu.: 198.50 3rd Qu.: 107.30
Max. :7639.0 Max. :11175.33 Max. :8562.54
Let’s understand what each column means in context:
| Column | Meaning |
|---|---|
year_month |
Time period in YYYYMM format (e.g., 202511 = November 2025) |
practice_code |
Unique identifier for each GP practice |
bnf_chemical_substance |
Code identifying the drug’s active ingredient |
quantity |
Number of physical units dispensed (tablets, capsules, etc.) |
items |
Number of prescription items issued |
adqusage |
Average Daily Quantities — a standardised measure of dose |
nic |
Net Ingredient Cost in pounds (£) |
Key insight: Each row represents all prescriptions of ONE chemical substance, from ONE practice, in ONE month. This data is already aggregated and we can’t see individual patient prescriptions.
What is a BNF code?
The bnf_chemical_substance column uses codes from the British National Formulary (BNF), the standard classification system for drugs prescribed in the NHS. Understanding this hierarchy is essential, because it’s how we identify antibiotics in the data.
The BNF organises all drugs into numbered chapters, each covering a body system or therapeutic area:
- Chapter 1 — Gastro-Intestinal System
- Chapter 2 — Cardiovascular System
- Chapter 3 — Respiratory System
- Chapter 4 — Central Nervous System
- Chapter 5 — Infections
Chapter 5 (Infections) is further divided into sections:
- 5.1 — Antibacterial Drugs (antibiotics — our focus)
- 5.2 — Antifungal Drugs
- 5.3 — Antiviral Drugs
And section 5.1 breaks down into paragraphs representing antibiotic families:
- 5.1.1 — Penicillins (e.g. amoxicillin, flucloxacillin)
- 5.1.2 — Cephalosporins (e.g. cefalexin)
- 5.1.3 — Tetracyclines (e.g. doxycycline)
- 5.1.5 — Macrolides (e.g. clarithromycin, azithromycin)
This hierarchy is encoded in BNF codes. In this aggregated dataset we keep the chemical substance code (e.g. 0501013B0 for amoxicillin) rather than the full 15-character presentation code.
05— Chapter 5: Infections0501— Section 5.1: Antibacterial Drugs050101— Paragraph 5.1.1: Penicillins0501013B0— Chemical substance (amoxicillin)0501013B0AAAAAA— Specific product, strength, and form
Key point: This is why filtering with
str_starts(bnf_chemical_substance, "0501")captures every antibiotic in our dataset. Any code starting with “0501” belongs to BNF section 5.1 — Antibacterial Drugs.
Lesson: understand the domain, not just the data
It’s tempting to jump straight into code, but time spent understanding what the data represents pays for itself many times over. Knowing that a GP practice is a primary care clinic, that BNF codes encode a drug hierarchy, and that each row is an aggregation rather than an individual prescription. These aren’t technical details, they’re the foundation every later decision rests on. Without that context, you can write perfectly valid code that answers the wrong question. The best analysts invest as much effort in learning the problem domain as they do in learning the tools.
4. Data Transformation
Raw data rarely comes in the exact form we need. We often need to create new variables, filter to relevant subsets, or reshape the data. The dplyr package provides intuitive functions for these tasks.
Selecting Columns with select()
When you only need certain columns, select() keeps your data tidy and reduces clutter.
# Keep only the columns we need for a prescribing analysis
antibiotics %>%
select(practice_code, bnf_chemical_substance, items, quantity, adqusage) %>%
head() %>%
mutate(across(where(is.numeric), ~round(., 1))) %>%
knitr::kable(format.args = list(big.mark = ","))| practice_code | bnf_chemical_substance | items | quantity | adqusage |
|---|---|---|---|---|
| 00L998 | 0501030I0 | 1 | 6 | 6.0 |
| 00N998 | 0501012G0 | 1 | 20 | 10.0 |
| 00N998 | 0501013B0 | 4 | 121 | 48.7 |
| 00N998 | 0501015P0 | 1 | 22 | 0.0 |
| 00N998 | 0501030I0 | 1 | 8 | 8.0 |
| 00N998 | 0501050B0 | 1 | 14 | 14.0 |
The Pipe Operator
The
%>%pipe comes from the tidyverse (specifically the magrittr package). It takes the output of one function and passes it as input to the next. Read it as “then”. R 4.1+ also provides a built-in native pipe|>which works similarly.
New Variables with mutate()
Let’s calculate the average quantity per prescription item. This tells us how many physical units (tablets, capsules, etc.) are dispensed per prescription.
# Add a new column for quantity per item
antibiotics <- antibiotics %>%
mutate(quantity_per_item = quantity / items)
# Check it worked
antibiotics %>%
select(practice_code, bnf_chemical_substance, items, quantity, quantity_per_item) %>%
head(10) %>%
knitr::kable(digits = c(0, 0, 0, 0, 1))| practice_code | bnf_chemical_substance | items | quantity | quantity_per_item |
|---|---|---|---|---|
| 00L998 | 0501030I0 | 1 | 6 | 6.0 |
| 00N998 | 0501012G0 | 1 | 20 | 20.0 |
| 00N998 | 0501013B0 | 4 | 121 | 30.2 |
| 00N998 | 0501015P0 | 1 | 22 | 22.0 |
| 00N998 | 0501030I0 | 1 | 8 | 8.0 |
| 00N998 | 0501050B0 | 1 | 14 | 14.0 |
| 00N998 | 0501130R0 | 1 | 6 | 6.0 |
| 00P998 | 0501013B0 | 1 | 15 | 15.0 |
| 00P998 | 0501021L0 | 1 | 6 | 6.0 |
| 00P998 | 0501030I0 | 2 | 6 | 3.0 |
Filtering Rows with filter()
Perhaps we only want to look at a specific time period or a subset of practices.
# How many rows in November 2025?
antibiotics %>%
filter(year_month == 202511) %>%
nrow()[1] 139596
# How many rows have large quantities (over 100 units per item)?
antibiotics %>%
filter(quantity_per_item > 100) %>%
nrow()[1] 5996
5. Summarising Data
A powerful pattern in data analysis is split-apply-combine: split the data into groups, apply a calculation to each group, and combine the results. In dplyr, we use group_by() and summarise().
Summarising by Drug Type
Which antibiotics are prescribed most frequently?
# Total prescriptions by chemical substance
by_chemical <- antibiotics %>%
group_by(bnf_chemical_substance) %>%
summarise(
total_items = sum(items, na.rm = TRUE),
total_quantity = sum(quantity, na.rm = TRUE),
n_records = n()
) %>%
left_join(chemical_names, by = "bnf_chemical_substance") %>%
arrange(desc(total_items))
# Show top 10
by_chemical %>%
head(10) %>%
mutate(
total_items = format(total_items, big.mark = ","),
total_quantity = format(round(total_quantity), big.mark = ","),
n_records = format(n_records, big.mark = ",")
) %>%
select(bnf_chemical_substance_name, total_items,
total_quantity, n_records) %>%
knitr::kable(
col.names = c("Chemical Substance", "Total Items", "Total Quantity",
"Records"),
align = c("l", "r", "r", "r")
)| Chemical Substance | Total Items | Total Quantity | Records |
|---|---|---|---|
| Amoxicillin | 625,733 | 5,726,206 | 7,337 |
| Doxycycline hyclate | 334,022 | 1,260,655 | 7,361 |
| Nitrofurantoin | 297,930 | 1,259,418 | 7,156 |
| Flucloxacillin sodium | 279,086 | 4,097,336 | 7,309 |
| Phenoxymethylpenicillin (Penicillin V) | 181,801 | 6,496,624 | 6,919 |
| Clarithromycin | 142,694 | 1,533,259 | 7,051 |
| Trimethoprim | 127,759 | 2,300,926 | 6,879 |
| Co-amoxiclav (Amoxicillin/clavulanic acid) | 96,145 | 1,810,309 | 6,982 |
| Azithromycin | 75,667 | 869,453 | 6,152 |
| Cefalexin | 74,237 | 1,804,366 | 6,533 |
Handling Missing Values
The
na.rm = TRUEargument tells R to ignore missing values when calculating sums and means. Without it, a single missing value would make the entire resultNA.
Summarising by Practice
Which practices have the highest antibiotic prescribing rate? Raw totals can be misleading because larger practices naturally prescribe more, so we normalise by the number of registered patients.
# Load patient list sizes
practice_demographics <- read_csv("data/processed/lookups/practice_demographics.csv",
show_col_types = FALSE)
# Total prescriptions by practice, normalised per 1,000 patients
by_practice <- antibiotics %>%
group_by(practice_code) %>%
summarise(
total_items = sum(items, na.rm = TRUE),
n_different_drugs = n_distinct(bnf_chemical_substance)
) %>%
inner_join(practice_demographics %>% select(practice_code, list_size),
by = "practice_code") %>%
filter(list_size >= 100) %>%
mutate(items_per_1000 = round((total_items / list_size) * 1000)) %>%
arrange(desc(items_per_1000))
# Show top 10 practices
by_practice %>%
head(10) %>%
mutate(
items_per_1000 = format(items_per_1000, big.mark = ","),
total_items = format(total_items, big.mark = ","),
list_size = format(list_size, big.mark = ",")
) %>%
select(practice_code, list_size, total_items, items_per_1000, n_different_drugs) %>%
knitr::kable(
col.names = c("Practice Code", "Registered Patients", "Total Items",
"Items per 1,000 Patients", "No. Different Drugs"),
align = c("l", "r", "r", "r", "r")
)| Practice Code | Registered Patients | Total Items | Items per 1,000 Patients | No. Different Drugs |
|---|---|---|---|---|
| Y00054 | 116 | 51 | 440 | 10 |
| Y05857 | 136 | 50 | 368 | 13 |
| Y02625 | 1,225 | 335 | 273 | 21 |
| L84029 | 3,297 | 856 | 260 | 24 |
| Y06345 | 1,448 | 371 | 256 | 21 |
| Y02751 | 11,228 | 2,416 | 215 | 23 |
| Y06113 | 460 | 95 | 207 | 13 |
| Y04225 | 1,191 | 184 | 154 | 19 |
| Y08411 | 306 | 46 | 150 | 10 |
| E87711 | 285 | 41 | 144 | 10 |
The n_distinct() function counts unique values. In this case it’s how many different antibiotic types each practice prescribed.
We note for now there are some very high rates. We will explore this further shortly.
6. Visualisation
Now let’s turn those numbers into charts. We’ll use ggplot2, which builds plots layer by layer using a “grammar of graphics”.
Top 10 Antibiotics (Bar Chart)
Which antibiotics are prescribed most often?
ggplot(by_chemical %>% head(10),
aes(x = reorder(bnf_chemical_substance_name, total_items),
y = total_items)) +
geom_col(fill = "#F57C00", colour = "white") +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Top 10 Antibiotics by Prescription Volume",
x = NULL,
y = "Total Items Prescribed"
) +
theme_minimal()The bar chart makes it easy to compare volumes. Using coord_flip() turns vertical bars horizontal, making labels easier to read.
Prescribing Rates (Overview)
So far we’ve been looking at raw prescription data. To compare practices fairly, we can use a rate. Initially we will start with the number of antibiotic items per 1,000 total prescriptions. This measures the share of a practice’s prescribing that is antibiotics. This does not include external data and is just a simple proportion from the prescribing data itself.
# Load practice-level monthly data
practice_data_summary <- read_csv("data/processed/teaching/antibiotics_practice_summary_latest.csv",
show_col_types = FALSE) %>%
filter_to_month() %>%
filter(practice_name != "UNIDENTIFIED DOCTORS")
# Summarise antibiotic items per practice
practice_rates <- practice_data_summary %>%
group_by(practice_code, practice_name) %>%
summarise(
total_items = sum(total_all_items, na.rm = TRUE),
abx_items = sum(antibiotic_items, na.rm = TRUE),
.groups = "drop"
) %>%
filter(abx_items >= 100) %>%
mutate(abx_rate_per_1000 = (abx_items / total_items) * 1000)
cat(sprintf("Practices with 100+ antibiotic items: %s\n",
format(nrow(practice_rates), big.mark = ",")))Practices with 100+ antibiotic items: 6,261
Prescribing Rate Distribution
ggplot(practice_rates, aes(x = abx_rate_per_1000)) +
geom_histogram(binwidth = 5, fill = "#1976D2", colour = "white") +
labs(
title = "Distribution of Antibiotic Prescribing Rates Across Practices",
x = "Antibiotic Items per 1,000 Prescriptions",
y = "Number of Practices"
) +
theme_minimal()The histogram reveals a roughly bell-shaped distribution with a long right tail. There are a handful of practices with prescribing rates two or three times higher than the majority. Are these practices genuinely over-prescribing, or is something else going on?
Antibiotic Items vs Practice Size
Let’s now check whether practice size (registered patients) relates to antibiotic volume. If larger practices prescribe proportionally more antibiotics, we’d expect a strong positive relationship. This would confirm patient list size as a sensible denominator for later analysis.
ggplot(practice_rates %>%
inner_join(practice_demographics %>% select(practice_code, list_size),
by = "practice_code"),
aes(x = list_size, y = abx_items)) +
geom_point(alpha = 0.3, colour = "#1976D2") +
geom_smooth(method = "lm", se = FALSE, colour = "#C62828", linetype = "dashed") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Larger Practices Prescribe More Antibiotics",
subtitle = "Strong positive relationship confirms patient list size as a sensible denominator",
x = "Registered Patients",
y = "Total Antibiotic Items (analysis month)"
) +
theme_minimal()The strong linear relationship confirms that practices with more patients prescribe proportionally more antibiotics. This means we can meaningfully normalise by list size when we need per-capita rates later.
Classifying Provider Type
The NHS Organisation Data Service (ODS) assigns each prescribing organisation a role code. We can use this to distinguish GP practices from walk-in centres, out-of-hours services, and other settings.
# Load ODS prescriber data (headerless CSV; col 1 = practice code, col 26 = role code)
ods_raw_practice_roles <- read_csv("data/raw/epraccur.csv", col_names = FALSE, show_col_types = FALSE)
# Regex used to identify non-traditional GP services by name
non_gp_pattern <- "WALK.?IN|GP.?LED|EXTENDED ACCESS|URGENT CARE|FEDERAT|FEDERATION|NETWORK|GPA\\b"
ods_roles_practice <- ods_raw_practice_roles %>%
select(practice_code = X1, role_code = X26) %>%
mutate(
primary_role = sub("\\|.*", "", role_code),
provider_type = case_when(
primary_role == "RO76" ~ "GP Practice",
primary_role == "RO80" ~ "Out of Hours",
primary_role == "RO87" ~ "Walk-in Centre",
primary_role == "RO259" ~ "Urgent & Emergency Care",
primary_role == "RO321" ~ "Extended Access / Hub",
TRUE ~ "Other"
)
) %>%
select(practice_code, provider_type) %>%
distinct(practice_code, .keep_all = TRUE)
# Join provider type to practice rates
practice_rates <- practice_rates %>%
left_join(ods_roles_practice, by = "practice_code") %>%
mutate(
provider_type = if_else(is.na(provider_type), "Other", provider_type),
provider_type = if_else(provider_type == "GP Practice" &
grepl(non_gp_pattern, toupper(practice_name)),
"Other",
provider_type)
)
# How many of each type?
practice_rates %>%
count(provider_type, sort = TRUE) %>%
knitr::kable(col.names = c("Provider Type", "Count"), align = c("l", "r"))| Provider Type | Count |
|---|---|
| GP Practice | 5821 |
| Other | 164 |
| Out of Hours | 125 |
| Extended Access / Hub | 72 |
| Urgent & Emergency Care | 52 |
| Walk-in Centre | 27 |
Prescribing Rates by Provider Type
ggplot(practice_rates, aes(x = reorder(provider_type, abx_rate_per_1000, FUN = median),
y = abx_rate_per_1000)) +
geom_boxplot(fill = "#E3F2FD", colour = "grey40") +
coord_flip() +
labs(
title = "Antibiotic Prescribing Rates by Provider Type",
subtitle = "Walk-in centres and urgent care have much higher rates than GP practices",
x = NULL,
y = "Antibiotic Items per 1,000 Prescriptions"
) +
theme_minimal()The outliers are now much more easily explained. Walk-in centres, urgent care hubs, and out-of-hours services see mostly acute illness (infections, injuries), so antibiotics make up a much larger share of their prescriptions. What looked like alarming outliers are entirely expected once you account for provider type.
A Key Lesson
Context matters in data analysis. What initially looks like a powerful result, may become easily explainable with more exploration and a tightening in approach.
7. Simple Statistical Analysis
Visualisation helps us spot patterns, but statistics help us quantify them. In Section 6, we saw that provider type explains the most extreme outliers. Now let’s filter to GP practices only and explore what explains the remaining variation?
gp_practices_analysis <- practice_rates %>%
filter(provider_type == "GP Practice")
cat(sprintf("GP practices for analysis: %d\n", nrow(gp_practices_analysis)))GP practices for analysis: 5821
Adding Demographic Context
One plausible explanation is age profile. Elderly patients are more susceptible to respiratory and urinary tract infections, so practices with older populations might reasonably prescribe more antibiotics. We can explore that first, then check deprivation (IMD).
# Join to GP practices (add list_size to recompute rate per 1,000 patients)
gp_with_imd <- gp_practices_analysis %>%
left_join(practice_demographics %>% select(practice_code, imd_score, list_size, pct_over_65),
by = "practice_code") %>%
filter(!is.na(imd_score)) %>%
filter(!is.na(list_size), list_size >= 100) %>%
mutate(
abx_rate_per_1000 = (abx_items / list_size) * 1000,
log_rate = log(abx_rate_per_1000)
)
cat(sprintf("Dropped for small/unknown list sizes: %d\n",
nrow(gp_practices_analysis) - nrow(gp_with_imd)))Dropped for small/unknown list sizes: 16
cat(sprintf("GP practices with IMD data: %d\n", nrow(gp_with_imd)))GP practices with IMD data: 5805
Correlation
Correlation measures the strength and direction of a linear relationship between two variables. It ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 meaning no linear relationship.
Age Profile vs Prescribing Rate
age_correlation <- cor(gp_with_imd$pct_over_65,
gp_with_imd$abx_rate_per_1000,
use = "complete.obs")
print(paste("Correlation between % over 65 and prescribing rate:",
round(age_correlation, 3)))[1] "Correlation between % over 65 and prescribing rate: 0.5"
ggplot(gp_with_imd, aes(x = pct_over_65, y = log_rate)) +
geom_point(alpha = 0.3, colour = "#1976D2") +
geom_smooth(method = "lm", se = TRUE, colour = "#C62828", linetype = "dashed") +
labs(
title = "Does Age Profile Affect Antibiotic Prescribing?",
subtitle = paste("Correlation:", round(age_correlation, 3)),
x = "Patients Aged 65+ (%)",
y = "log(Antibiotic Items per 1,000 Patients)"
) +
theme_minimal()The percentage of patients over 65 alone explains about 25% of the variation in prescribing rates. The pattern makes clinical sense.
Deprivation vs Prescribing Rate
Age profile is a strong predictor. What about deprivation? The Index of Multiple Deprivation (IMD) measures how deprived an area is. The higher scores mean greater deprivation.
imd_correlation <- cor(gp_with_imd$imd_score,
gp_with_imd$abx_rate_per_1000,
use = "complete.obs")
print(paste("Correlation between IMD score and prescribing rate:",
round(imd_correlation, 3)))[1] "Correlation between IMD score and prescribing rate: 0.052"
ggplot(gp_with_imd, aes(x = imd_score, y = log_rate)) +
geom_point(alpha = 0.3, colour = "#1976D2") +
geom_smooth(method = "lm", se = TRUE, colour = "#C62828", linetype = "dashed") +
labs(
title = "Does Local Deprivation Affect Antibiotic Prescribing?",
subtitle = paste("Correlation:", round(imd_correlation, 3)),
x = "IMD Score (higher = more deprived)",
y = "log(Antibiotic Items per 1,000 Patients)"
) +
theme_minimal()There is a weak positive correlation. Practices in more deprived areas tend to prescribe slightly more antibiotics per patient. This is consistent with higher infection burden in disadvantaged populations. However the relationship is weak and IMD alone explains a tiny fraction of variation.
But even the strongest single predictor leaves most of the variation unexplained. What if we combined multiple predictors into a single model? We can do that using regression based methods which is what we will do in the next section.
8. Log-Rate Linear Model
So far, we’ve explored the data using summaries and visualisation. Now we’ll take it further by building a linear regression model. This stabilises variance and reduces the influence of high-rate outliers. There are other choices we could consider here, for example negative binomial or Poission. However, for simplicity we will restrict ourselves to a simple model for the purpose of this tutorial.
The model predicts a practice’s log prescribing rate from deprivation and age profile.
8.1 Loading Practice-Level Data
In sections 2–7, we worked with raw prescription records (one row per drug per practice per month). Now we’ll reuse the pre-aggregated file that summarises each practice’s total prescribing activity per month, including both antibiotic and non-antibiotic items.
# Reuse practice-level monthly data loaded in Section 5
head(practice_data_summary)# A tibble: 6 × 11
practice_code icb_code year_month practice_name antibiotic_items
<chr> <chr> <int> <chr> <dbl>
1 A81001 QHM 202511 THE DENSHAM SURGERY 129
2 A81002 QHM 202511 QUEENS PARK MEDICAL CENTRE 896
3 A81004 QHM 202511 ACKLAM MEDICAL CENTRE 484
4 A81005 QHM 202511 SPRINGWOOD SURGERY 310
5 A81006 QHM 202511 TENNANT STREET MEDICAL PRA… 712
6 A81007 QHM 202511 BANKHOUSE SURGERY 366
# ℹ 6 more variables: antibiotic_quantity <dbl>, antibiotic_nic <dbl>,
# total_all_items <dbl>, abx_rate_per_1000 <dbl>, cost_per_item <dbl>,
# cost_per_quantity <dbl>
This file has one row per practice per month, with columns for total items, antibiotic items, and costs. Because we’re working with a single month, we’ll use that month directly rather than building a 12-month summary.
# Single-month practice summary
practice_summary_one_month <- practice_data_summary %>%
group_by(practice_code, practice_name) %>%
summarise(
total_items_1m = sum(total_all_items, na.rm = TRUE),
abx_items_1m = sum(antibiotic_items, na.rm = TRUE),
.groups = "drop"
) %>%
filter(abx_items_1m >= 100) # Exclude tiny practices with unstable rates
cat(sprintf("Practices with 100+ antibiotic items: %d
", nrow(practice_summary_one_month)))Practices with 100+ antibiotic items: 6261
cat(sprintf("Analysis month: %d
", unique(practice_data_summary$year_month)))Analysis month: 202511
Why Filter to 100+ Items?
Very small practices produce unreliable rates. This means a handful of prescriptions either way would swing their rate dramatically. The 100-item threshold keeps most practices while still removing extreme small denominators.
8.2 Filtering to GP Practices
As in Section 7, we need to filter to traditional GP practices before modelling. The prescribing data includes walk-in centres, out-of-hours services, and other non-GP settings that have fundamentally different prescribing profiles. We reuse the same three-layer filter from earlier: ODS role code, keyword matching, and manual exclusion list.
# Reuse ODS roles from Section 6 to filter to GP practices
practice_summary_one_month <- practice_summary_one_month %>%
left_join(ods_roles_practice, by = "practice_code") %>%
mutate(provider_type = if_else(is.na(provider_type), "Other", provider_type)) %>%
filter(provider_type == "GP Practice",
!grepl(non_gp_pattern, toupper(practice_name))) %>%
select(-provider_type)
cat(sprintf("GP practices for modelling: %d
", nrow(practice_summary_one_month)))GP practices for modelling: 5821
8.3 Adding Demographic Context
Prescribing rates don’t exist in a vacuum. A practice in a deprived area with an elderly population will naturally see more infections than one in an affluent area with young professionals. To make fair comparisons, we need to account for these differences.
# Reuse practice demographics (created by scripts/download_demographics.R)
# Preview the columns we'll use in the model
practice_demographics %>%
select(practice_code, list_size, imd_score, pct_over_65, pct_under_5) %>%
head()# A tibble: 6 × 5
practice_code list_size imd_score pct_over_65 pct_under_5
<chr> <dbl> <dbl> <dbl> <dbl>
1 A81001 3890 32.4 11.3 2.35
2 A81002 18738 31.9 11.6 2.21
3 A81004 11343 28.0 10.2 2.74
4 A81005 7651 15.4 15.9 1.67
5 A81006 14651 35.6 10.1 2.30
6 A81007 10614 34.4 9.51 2.39
The demographic variables are:
| Variable | Meaning |
|---|---|
list_size |
Number of registered patients at the practice — used to compute prescribing rates per 1,000 patients |
imd_score |
Index of Multiple Deprivation — higher values mean more deprived populations, which are associated with higher infection rates |
pct_over_65 |
Percentage of registered patients aged 65+ - elderly patients have more respiratory and urinary tract infections |
pct_under_5 |
Percentage of registered patients aged under 5 - young children have frequent acute illness |
# Join demographics to GP practices
gp_practices_model <- practice_summary_one_month %>%
left_join(
practice_demographics %>% select(practice_code, list_size, imd_score, pct_over_65, pct_under_5),
by = "practice_code"
) %>%
filter(!is.na(imd_score), !is.na(pct_over_65)) %>%
filter(!is.na(list_size), list_size >= 500) %>%
mutate(rate_per_1000 = (abx_items_1m / list_size) * 1000,
log_rate = log(rate_per_1000))
n_dropped <- nrow(practice_summary_one_month) - nrow(gp_practices_model)
cat(sprintf("GP practices with demographics: %d
", nrow(gp_practices_model)))GP practices with demographics: 5805
cat(sprintf("Dropped (missing demographics): %d
", n_dropped))Dropped (missing demographics): 16
It’s reassuring that we don’t lose many records when matching to the demographic data.
8.4 Fitting the Log-Rate Model
We model the log of the prescribing rate so that extreme values have less influence.
# Fit the log-rate linear model
model_log_rate <- lm(
log_rate ~ imd_score + pct_over_65 + pct_under_5,
data = gp_practices_model
)
summary(model_log_rate)
Call:
lm(formula = log_rate ~ imd_score + pct_over_65 + pct_under_5,
data = gp_practices_model)
Residuals:
Min 1Q Median 3Q Max
-1.66205 -0.13776 0.00401 0.14080 2.07539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3609629 0.0215429 109.59 <2e-16 ***
imd_score 0.0077401 0.0002963 26.13 <2e-16 ***
pct_over_65 0.0708426 0.0010104 70.11 <2e-16 ***
pct_under_5 0.1697899 0.0060962 27.85 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2303 on 5801 degrees of freedom
Multiple R-squared: 0.4595, Adjusted R-squared: 0.4592
F-statistic: 1644 on 3 and 5801 DF, p-value: < 2.2e-16
8.5 Interpreting the Coefficients
We’ll use broom::tidy() to get a clean coefficient table.
# Tidy coefficient table
tidy(model_log_rate, conf.int = TRUE)# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.36 0.0215 110. 0 2.32 2.40
2 imd_score 0.00774 0.000296 26.1 2.40e-142 0.00716 0.00832
3 pct_over_65 0.0708 0.00101 70.1 0 0.0689 0.0728
4 pct_under_5 0.170 0.00610 27.9 2.45e-160 0.158 0.182
8.6 Goodness of Fit
model_summary <- summary(model_log_rate)
cat(sprintf("R-squared: %.3f (%.1f%% of variation explained)
",
model_summary$r.squared, model_summary$r.squared * 100))R-squared: 0.460 (46.0% of variation explained)
cat(sprintf("Adjusted R-squared: %.3f
", model_summary$adj.r.squared))Adjusted R-squared: 0.459
cat(sprintf("F-statistic: %.1f (p < 2.2e-16)
",
model_summary$fstatistic[1]))F-statistic: 1643.9 (p < 2.2e-16)
8.7 Identifying Outlier Practices
With the model fitted, we can ask: which GP practices prescribe significantly more or fewer antibiotics than their demographics would predict? These are the practices worth investigating further — not because high prescribing is necessarily wrong, but because unexplained deviation from expectation is a prompt for inquiry.
We use standardised residuals to measure how far each practice sits from the model’s prediction, in units of standard deviation.
# Classify each practice by its standardised residual
gp_practices_model <- gp_practices_model %>%
mutate(
fitted_log_rate = fitted(model_log_rate),
std_residual = rstandard(model_log_rate),
residual_flag = case_when(
std_residual >= 3 ~ "Extreme high (>3 SD)",
std_residual >= 2 ~ "High (2-3 SD)",
std_residual <= -3 ~ "Extreme low (<-3 SD)",
std_residual <= -2 ~ "Low (2-3 SD)",
TRUE ~ "Expected"
)
)
n_high_2sd <- sum(gp_practices_model$std_residual >= 2, na.rm = TRUE)
n_high_3sd <- sum(gp_practices_model$std_residual >= 3, na.rm = TRUE)
n_low_2sd <- sum(gp_practices_model$std_residual <= -2, na.rm = TRUE)
cat(sprintf("Practices prescribing MORE than expected:\n"))Practices prescribing MORE than expected:
cat(sprintf(" > 2 SD above: %d practices\n", n_high_2sd)) > 2 SD above: 118 practices
cat(sprintf(" > 3 SD above: %d practices\n", n_high_3sd)) > 3 SD above: 15 practices
cat(sprintf("Practices prescribing LESS than expected:\n"))Practices prescribing LESS than expected:
cat(sprintf(" > 2 SD below: %d practices\n", n_low_2sd)) > 2 SD below: 150 practices
- A residual is the difference between actual and predicted log rate:
actual − predicted - A positive residual means the practice prescribes more than its demographics suggest
- Standardised residuals divide by the residual standard error, putting everything on a common scale
- Values beyond ±2 are unusual (~5% of practices); beyond ±3 are very unusual (~0.3%)
residual_colors <- c(
"Expected" = "#2E7D32",
"High (2-3 SD)" = "#F57C00",
"Extreme high (>3 SD)" = "#C62828",
"Low (2-3 SD)" = "#42A5F5",
"Extreme low (<-3 SD)" = "#1565C0"
)
sigma <- model_summary$sigma
ggplot(gp_practices_model, aes(x = fitted_log_rate, y = log_rate)) +
# 1:1 reference line (perfect prediction)
geom_abline(intercept = 0, slope = 1,
colour = "grey40", linetype = "dashed", linewidth = 0.8) +
# +/- 2 SD bands
geom_abline(intercept = 2 * sigma, slope = 1,
colour = "#F57C00", linetype = "dotted", linewidth = 0.5) +
geom_abline(intercept = -2 * sigma, slope = 1,
colour = "#42A5F5", linetype = "dotted", linewidth = 0.5) +
# +/- 3 SD bands
geom_abline(intercept = 3 * sigma, slope = 1,
colour = "#C62828", linetype = "dotted", linewidth = 0.5) +
geom_abline(intercept = -3 * sigma, slope = 1,
colour = "#1565C0", linetype = "dotted", linewidth = 0.5) +
# Points coloured by residual category
geom_point(aes(colour = residual_flag), alpha = 0.5, size = 2) +
scale_colour_manual(values = residual_colors, drop = FALSE) +
labs(
title = "Which GP Practices Prescribe More Antibiotics Than Expected?",
subtitle = sprintf(
"Linear model (R\u00b2 = %.2f) accounts for deprivation and age profile",
model_summary$r.squared
),
x = "Model-predicted log rate (per 1,000 patients)",
y = "Actual log prescribing rate (per 1,000 patients)",
colour = "Residual\ncategory"
) +
theme_minimal(base_size = 14)Practices above the dashed line prescribe more than the model predicts given their demographics; those below prescribe less. The dotted bands mark the ±2 and ±3 standard deviation thresholds. A practice beyond ±3 SD is worth closer attention — though a large residual tells you something is unusual, not that something is wrong.
9. Practical Considerations
Dealing with Large Data
For teaching we worked with a single month to keep file sizes small and the workflow fast. In real‑world analysis you should aim to use the full dataset and build infrastructure to handle it:
- Seek to use all the data for robust conclusions.
- Consider more compute (or GPU‑backed workflows) for heavier modelling.
- Data work takes time. For example this tutorial involves a lot of work, but includes several simplifications; and in practice there are many more avenues to explore.
Reproducibility
Analysis can become much more useful if others can reproduce it. Beyond the environment, reproducibility means being able to rebuild results and trust that changes don’t introduce bugs:
- Use version control (git) so changes are tracked and reversible.
- Look to support the user with information about the environment and packages they need to successfully run your work.
- Run tests or validation checks to confirm code still works as expected.
Using AI (with oversight)
AI can speed up coding and drafting, but statistical choices still need human judgement:
- AI can be very useful for boilerplate and exploration.
- Documentation and knowledge capture
- Test generation and validation
- Architectural consistency can become an issue without strong human ownership
- Model choices, thresholds, and transformations require careful review.
Conclusion
In this tutorial, you’ve learned the fundamental workflow of data analysis:
- Read data into R using
read_csv() - Understand your data with
dim(),names(),summary(), andglimpse() - Transform data using
select(),mutate(), andfilter() - Summarise data with
group_by()andsummarise() - Visualise patterns with histograms, bar charts, and boxplots
- Analyse relationships with correlation and scatter plots
- Model continuous outcomes with linear regression and interpret residuals as outlier flags
These skills should transfer to any dataset.
Beyond the technical skills, five analytical lessons emerged along the way:
- Most of the work is data preparation. Sourcing, cleaning, and linking data took more effort than the modelling itself.
- Understand the domain, not just the data. Knowing what a GP practice is, how BNF codes work, and what each row represents shaped every decision we made.
- No single dataset tells the whole story. We needed prescribing records, demographics, and organisational classifications to draw meaningful conclusions.
- Apparent outliers may be easily explainable. The most extreme prescribers turned out to be walk-in centres.
- Model results are prompts for inquiry, not conclusions. A large residual tells you something is unusual, not that something is wrong.
Next Steps
This analysis used a simple linear model with three demographic predictors and achieved a modest R², meaning most of the variation in prescribing rates remains unexplained. There are two directions to improve on this.
Bring in more data. At present the model only reflects deprivation and age, so it misses other plausible drivers of prescribing. Any additional sources should go through the same pipeline: explicit sourcing, careful cleaning, and defensible joining, following the workflow used in sections 1–4.
Use a more appropriate model. We modelled a rate directly with lm(), which assumes the outcome can take any value, including negative ones. In reality, prescribing rates are counts divided by a population. Generalised linear models (GLMs) such as Poisson or Negative Binomial regression are designed for count data and would be a natural next step.
Useful Links
- ESPAUR Report (UKHSA / GOV.UK) — official surveillance of antibiotic use and resistance in England
- BNF 5.1: Antibacterial Drugs (OpenPrescribing) — explore national antibiotic prescribing trends by drug and practice
- Prescribing Data: BNF Codes (Bennett Institute) — how the 15-character BNF code hierarchy works
Data source: English Prescribing Dataset (EPD), NHSBSA Open Data Portal