Research Overview – Jubayer Hossain

🔬

Reproducibility

All datasets are publicly archived — every analysis can be independently verified and extended.

📊

Scale

Thousands of samples across tissues, diseases, and cohorts — far beyond single-lab capacity.

💡

Open Science

FAIR data principles enable cross-study meta-analyses and collaborative discovery.

⚙️

Generalisability

Multi-cohort validation increases confidence that findings are not dataset-specific artefacts.

Workflow 01

Bulk RNA-Seq Meta-Analysis

Large-scale transcriptomic discovery using harmonised public datasets from NCBI GEO & SRA

Why public bulk RNA-seq data? Public repositories (GEO, SRA, ArrayExpress) host 100,000+ RNA-seq samples across diseases, tissues, and treatment conditions. Meta-analysing these datasets reveals robust, cross-cohort differential expression signals that no single study can provide — dramatically increasing statistical power and biological generalisability.

🗄️

Data Discovery

GEO · SRA · ArrayExpress

→

🧹

Quality Control

FastQC · MultiQC · Trimmomatic

→

📐

Quantification

STAR · Salmon · featureCounts

→

⚖️

Normalisation

DESeq2 · edgeR · limma-voom

→

📊

Meta-Analysis

MetaVolcanoR · RankProd

→

🎯

Biomarker Panel

DEGs · Pathway Enrichment

GEO SRA ArrayExpress DESeq2 edgeR limma MetaVolcanoR GSEA R Python

Workflow 02

Single-Cell Harmonised Framework

Cross-cohort single-cell atlas construction and comparative cell-state analysis

Why public scRNA-seq data? Public single-cell datasets from CellxGene, GEO, and the Human Cell Atlas allow cross-cohort atlas construction at a scale impossible for a single lab. Integrating multiple studies removes batch-specific noise, reveals conserved cell states across diseases, and enables discovery of rare populations with unprecedented statistical confidence.

🗄️

Data Retrieval

CellxGene · GEO · HCA

→

🧹

QC & Filtering

Scrublet · DoubletFinder

→

🔗

Batch Integration

Harmony · scVI · BBKNN

→

🗺️

Clustering & UMAP

Leiden · UMAP · t-SNE

→

🏷️

Cell Annotation

CellTypist · ScType · Manual

→

🎯

Comparative Analysis

DEGs · Trajectory · Regulons

CellxGene HCA GEO Seurat Scanpy Harmony scVI PYSCENIC Python R

Workflow 03

ML / DL in Genomics

Machine and deep learning for biomarker discovery, disease classification, and precision medicine

Why ML/DL on public omics data? High-dimensional omics data (10,000+ features per sample) exceeds the capacity of classical statistics. Machine learning extracts non-linear, multi-feature patterns that predict disease status, prognosis, and drug response. Public datasets provide the large, diverse sample sizes required to train generalisable models — with SHAP enabling biological interpretability.

🗄️

Multi-omics Input

TCGA · GTEx · GEO

→

⚙️

Feature Engineering

Selection · Scaling · PCA

→

🤖

Model Training

XGBoost · RF · DNN · CNN

→

✅

Validation

Cross-validation · Hold-out

→

🔍

Interpretability

SHAP · LIME · Attention

→

🎯

Biomarker Panel

Diagnosis · Prognosis · Targets

TCGA GTEx GEO scikit-learn XGBoost PyTorch SHAP Python