Computational Pipeline for Cancer Neoantigen Discovery

NYU Grossman School of Medicine (2025)

During my summer at NYU Grossman, I contributed to the development of a computational pipeline to identify cancer-relevant neoantigen candidates from mutation-driven peptide changes. I focused on building reproducible data-processing workflows, generating peptide libraries, and producing outputs optimized for downstream immunology-focused filtering and prioritization. I’m still actively working on and extending this project.

Poster

Download poster (PDF)

Abstract

Neoantigens—peptides uniquely presented on tumor cells—are emerging as targets for next-generation immunotherapies, including T-cell receptor (TCR) and Peptide-Centric CAR-T cell therapies. APOBEC3A, a cytidine deaminase frequently upregulated in cancer, introduces C>T mutations that can give rise to novel neoantigens. We developed a computational pipeline to simulate APOBEC3A-like mutations across 35,608 coding transcripts and identify high-confidence 9-mer peptides suitable for immune targeting.

From 69,247 mutation sites, we generated 58,591 unique mutant peptides. We used 77 breast cancer immunopeptidomics datasets to detect their presence (TesorAI) narrowing this to 2,530; DepMap CERES scores prioritized 322 from essential genes. NetMHCpan 4.1 predicted strong HLA binding for 69 peptides, and NetMHCStabPan further refined this to 13 stable candidates.

We also began extending this pipeline to transposable elements (TEs), which are frequent APOBEC substrates and may harbor overlooked neoantigen sources. This work provides a framework for identifying mutation-derived peptides with strong MHC presentation potential—critical for expanding the scope of CAR-T cell therapies beyond shared antigens.

Tools & Methods

Languages: Python, Bash/Shell
Python stack: pandas, NumPy, Biopython (FASTA/sequence parsing), matplotlib (quick QC plots)
Data handling: large tabular joins/filters, deduplication, mutation-site indexing, reproducible file outputs (CSV/TSV/FASTA)
HPC workflows: job scheduling (Slurm), batch runs, environment management (conda/modules), large file I/O on shared storage
Peptide generation logic: overlapping 9-mer windows around mutation sites; WT vs mutant comparisons; peptide library construction
Immunology-facing outputs: formatting for downstream prediction & prioritization tools (binding/presentation/stability scoring)
Integration / datasets: immunopeptidomics screening (TesorAI workflow outputs), essentiality prioritization (DepMap CERES), HLA binding prediction (NetMHCpan / NetMHCStabPan)
Reproducibility: parameterized scripts, consistent naming/versioned outputs, traceable mapping (mutation → protein → peptide)

Contact

If you have questions about this project, feel free to reach out to me here: Kyle.Kylenguyen@nyulangone.org.