July 22, 2015
The development of a concept-recognition procedure that analyses the frequencies of Human Phenotype Ontology (HPO) disease annotations in over five million PubMed abstracts is described in The American Journal of Human Genetics. The authors performed the analysis in several major steps described below:
“(1) Bio-LarK was used to analyse the PubMed-MEDLINE 2014 corpus, which resulted in a total of 5,136,645 abstracts annotated with MeSH terms and phenotypic features.
(2) For each of 3,145 resulting diseases, the frequency and specificity of HPO terms found in the abstract were used for inferring phenotypic annotations.
(3) These annotations were used for producing disease models for each of the diseases.
(4) Medical validation of the annotations was performed on the basis of disease, phenotype, and SNP annotations in GWAS Central for phenotype sharing in common disease.
(5) Validation with OMIM, Orphanet, and DO was used for assessing phenotype sharing between rare and common diseases linked to the same locus.”
The authors state that by using this procedure the HPO has been able to compile “250,000 phenotypic annotations for over 10,000 rare and common diseases.” The authors believe that rare-disease phenotypes will prove to be useful in evaluating and comparing the phenotypic overlap between Mendelian and common disease. They emphasise that this is especially important when common and rare diseases share risk alleles or have phenotypic overlap due to their linkage by genomic location.