Reproducibility
All datasets are publicly archived — every analysis can be independently verified and extended.
Scale
Thousands of samples across tissues, diseases, and cohorts — far beyond single-lab capacity.
Open Science
FAIR data principles enable cross-study meta-analyses and collaborative discovery.
Generalisability
Multi-cohort validation increases confidence that findings are not dataset-specific artefacts.
Workflow 01
Bulk RNA-Seq Meta-Analysis
Large-scale transcriptomic discovery using harmonised public datasets from NCBI GEO & SRA
Why public bulk RNA-seq data?
Public repositories (GEO, SRA, ArrayExpress) host 100,000+ RNA-seq samples across diseases, tissues, and treatment conditions. Meta-analysing these datasets reveals robust, cross-cohort differential expression signals that no single study can provide — dramatically increasing statistical power and biological generalisability.
1
Data Discovery
GEO · SRA · ArrayExpress
→
2
Quality Control
FastQC · MultiQC · Trimmomatic
→
3
Quantification
STAR · Salmon · featureCounts
→
4
Normalisation
DESeq2 · edgeR · limma-voom
→
5
Meta-Analysis
MetaVolcanoR · RankProd
→
6
Biomarker Panel
DEGs · Pathway Enrichment
Workflow 02
Single-Cell Harmonised Framework
Cross-cohort single-cell atlas construction and comparative cell-state analysis
Why public scRNA-seq data?
Public single-cell datasets from CellxGene, GEO, and the Human Cell Atlas allow cross-cohort atlas construction at a scale impossible for a single lab. Integrating multiple studies removes batch-specific noise, reveals conserved cell states across diseases, and enables discovery of rare populations with unprecedented statistical confidence.
1
Data Retrieval
CellxGene · GEO · HCA
→
2
QC & Filtering
Scrublet · DoubletFinder
→
3
Batch Integration
Harmony · scVI · BBKNN
→
4
Clustering & UMAP
Leiden · UMAP · t-SNE
→
5
Cell Annotation
CellTypist · ScType · Manual
→
6
Comparative Analysis
DEGs · Trajectory · Regulons
Workflow 03
ML / DL in Genomics
Machine and deep learning for biomarker discovery, disease classification, and precision medicine
Why ML/DL on public omics data?
High-dimensional omics data (10,000+ features per sample) exceeds the capacity of classical statistics. Machine learning extracts non-linear, multi-feature patterns that predict disease status, prognosis, and drug response. Public datasets provide the large, diverse sample sizes required to train generalisable models — with SHAP enabling biological interpretability.
1
Multi-omics Input
TCGA · GTEx · GEO
→
2
Feature Engineering
Selection · Scaling · PCA
→
3
Model Training
XGBoost · RF · DNN · CNN
→
4
Validation
Cross-validation · Hold-out
→
5
Interpretability
SHAP · LIME · Attention
→
6
Biomarker Panel
Diagnosis · Prognosis · Targets