1 1. Introduction

This report summarizes the clustering of pathogen whole-genome sequencing samples using MicroTrace. Clustering is based on pairwise SNP distances. This enables identification of genetically related isolates, aiding outbreak detection and infection control.

2 2. Input Data

We read in the cluster assignments produced by MicroTrace, which include optional metadata (collection date, ward, etc.).

clusters <- read_csv("data/cluster_assignments.csv")

## Rows: 10 Columns: 5
## ── Column specification ───────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Sample, Ward, Patient_ID
## dbl  (1): Cluster
## date (1): Collection_Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(clusters)

## # A tibble: 6 × 5
##   Sample   Cluster Collection_Date Ward   Patient_ID
##   <chr>      <dbl> <date>          <chr>  <chr>     
## 1 Sample_1       1 2023-01-01      Ward_A P001      
## 2 Sample_2       1 2023-01-02      Ward_A P002      
## 3 Sample_3       1 2023-01-03      Ward_A P003      
## 4 Sample_4       1 2023-01-04      Ward_A P004      
## 5 Sample_5       1 2023-01-05      Ward_A P005      
## 6 Sample_6       2 2023-01-06      Ward_B P006

3 3. Cluster Summary Statistics

We summarize how many samples fall into each SNP-defined cluster, and display the distribution of collection dates and hospital wards for each cluster.

summary_table <- clusters %>%
  group_by(Cluster) %>%
  summarise(
    n_samples = n(),
    wards = paste(unique(Ward), collapse = ", "),
    dates = paste(min(Collection_Date), max(Collection_Date), sep = " to ")
  )
summary_table

## # A tibble: 2 × 4
##   Cluster n_samples wards  dates                   
##     <dbl>     <int> <chr>  <chr>                   
## 1       1         5 Ward_A 2023-01-01 to 2023-01-05
## 2       2         5 Ward_B 2023-01-06 to 2023-01-10

4 4. SNP Distance Distribution

4.1 4.1 Histogram

4.2 4.2 Density Plot

5 5. Dendrogram

The following dendrogram was generated by MicroTrace. Red dashed lines indicate the SNP threshold used to define clusters.

6 6. Intra-cluster SNP Distance Summary

We summarize the intra-cluster SNP distances for each cluster, including mean, standard deviation, and min/max distances.

intra_stats <- read_csv("data/intra_cluster_stats.csv")

## Rows: 2 Columns: 6
## ── Column specification ───────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Cluster, Size, Mean, SD, Min, Max
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

intra_stats

## # A tibble: 2 × 6
##   Cluster  Size  Mean    SD   Min   Max
##     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       1     5   1.7 0.675     1     3
## 2       2     5   2   1.05      1     4

7 7. Conclusion

MicroTrace enables rapid and reproducible outbreak cluster detection from SNP distance matrices. This HTML report provides a clear summary of potential outbreak groupings, supporting infection prevention and genomic epidemiology workflows.

MicroTrace Outbreak Clustering Report

Kaitao Lai

2025-07-02