Tool Development

The Informatics team is actively working on developing and making available a series of tools, covering Informatics environments, Genomics and Proteomics tools and Knowledge discovery tools.


fabric: flexible architecture for building research informatics collaborations

FABRIC is an environment that offers a service-oriented research toolbox which investigators, clinicians, and patient advocates can use to easily access a wide array of data repositories integrated with customizable query tools. The scalability of the environment allows the replication of its key elements (e.g., Galaxy, REDCap, R, HealthFacts, Office 365) making them available simultaneously to multiple users with varying requirements for data, or its manipulation, analysis, and reporting. FABRIC is a secure research cloud platform that readily interconnects clinicians, researchers, and the patient community across the national CTR network without disruptive alterations to research workflows or patient information finding. Its release is upcoming!

hive: high-performance integrated virtual environment platform

HIVE is a cloud-based environment optimized for the storage and analysis of extra-large data, such as biomedical data, clinical data, next-generation sequencing (NGS) data, mass spectrometry files, confocal microscopy images, post-market surveillance data, medical recall data, and many others. HIVE provides secure web access for authorized users to deposit, retrieve, annotate and compute on Big Data, and analyze the outcomes using web user interfaces.

This tool can be accessed through the Biochemistry and Molecular Medicine Department in the GW School of Medicine and Health Sciences. Click here.  


Screen Shot 2018-05-14 at 2.58.39 PM.png

pathostat statistical microbiome analysis package

PathoStat is a statistical package that performs Statistical Microbiome Analysis on metagenomics results from sequencing data samples. This tool can be accessed through a GitHub repository, and is being integrated into Galaxy (which is an open source, web-based platform for data intensive biomedical research, and will be one of the primary tools plugged in to the FABRIC environment). It is also available as an R Shiny app.



Telescope is a software package for identifying, characterizing (with respect to expression patterns), and mapping (with respect to the human genome) human endogenous retroviruses. TeleScope is still under development. 


PathoScope is a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. This tool can be accessed through a GitHub repository.

Key publications

Francis, O.E., Bendall, M., Manimaran, S. and Hong, C., 2013. G. Bruce Schaalje, Mark J. Clement, Keith A. Crandall & Johnson, WE (2013). Pathoscope: Species identification and strain attribution with unassembled sequencing data. Genome Research23(10), pp.1721-1729.

Hong, C., Manimaran, S., Shen, Y., Perez-Rogers, J.F., Byrd, A.L., Castro-Nallar, E., Crandall, K.A. and Johnson, W.E., 2014. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome2(1), p.33.

Byrd, A.L., Perez-Rogers, J.F., Manimaran, S., Castro-Nallar, E., Toma, I., McCaffrey, T., Siegel, M., Benson, G., Crandall, K.A. and Johnson, W.E., 2014. Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data. BMC Bioinformatics15(1), p.262.

Screen Shot 2018-05-14 at 10.18.27 AM.png



HaPhPipe is a bioinformatics pipeline for taking targeted amplicon sequence data from viral studies and performing 1) quality control, 2) contig assembly, 3) consensus sequence creation, and 4) haplotype calling for downstream population genetic and molecular evolutionary studies. HaPhPipe is still under development.


healthfacts database

HealthFacts is a relational database made up of EMR data (from Cerner) on 414,435,400 patient encounters that took place in 300 hospitals and clinics across the U.S.  It provides fine-grained data on diagnoses, lab and microbiology tests, procedures, and medications. HealthFacts will be one of the first data sources living in the FABRIC platform. 

r shinny for healthfacts

This is an exploratory preview tool that allows users to upload a subset of HealthFacts data and explore the features and distribution. This tool can also be used to refine the cohort selection. This tool will be accessible through FABRIC and as an R Shiny app.



children like mine

This tool is under development to enable clinicians and parents to explore the aggregated experiences of children with the same health issues as their own patients/children.  Users begin by searching on multiple aspects of their child’s case, such as diagnosis and drug treatment.  The application then provides visualized data of similar children, such as age and gender distributions, and outcome information, such as mortality. This tool will be accessible through FABRIC and as an App through its Web interface.


nlp tools

The team is developing a variety of text mining tools and modules, which, along with the Downloads, Documentation, Bug Reports, Projects, cNLP EcoSystem Projects and Datasets, are shared through (a Web site hosted through Amazon services).

predictive models

The team is developing a variety of machine learning modules including codes and executables. Some of these are shared through a GitHub repository, and others are described in publications


Screen Shot 2018-05-09 at 4.08.22 PM.png

biomuta and bioxpress

BioMuta is a database of cancer associated single-nucleotide variations, while BioXpress is a database of cancer associated differentially expressed genes and microRNAs.These tools can be accessed through the Biochemistry and Molecular Medicine Department in the GW School of Medicine and Health Sciences. Click here