Co-reporter:Yu Wang;Yanzhi Guo;Xuemei Pu;Menglong Li
RSC Advances (2011-Present) 2017 vol. 7(Issue 31) pp:18937-18945
Publication Date(Web):2017/03/28
DOI:10.1039/C6RA27161H
Molecular recognition features (MoRFs) are relatively short segments (10–70 residues) within intrinsically disordered regions (IDRs) that can undergo disorder-to-order transitions during binding to partner proteins. Since MoRFs play key roles in important biological processes such as signaling and regulation, identifying them is crucial for a full understanding of the functional aspects of the IDRs. However, given the relative sparseness of MoRFs in protein sequences, the accuracy of the available MoRF predictors is often inadequate for practical usage, which leaves a significant need and room for improvement. In this work, we developed a novel sequence-based predictor for MoRFs using a support vector machine (SVM) algorithm. First, we constructed a comprehensive dataset of annotated MoRFs with the wide length between 10 and 70 residues. Our method firstly utilized the flanking regions to define the negative samples. Then, amino acid composition (AAC) and two previously unexplored features including composition, transition and distribution (CTD) and K nearest neighbors (KNN) score were used to characterize sequence information of MoRFs. Finally, using five-fold cross-validation, an overall accuracy of 75.75% was achieved through feature evaluation and optimization. When performed on an independent test set of 110 proteins, the method also yielded a promising accuracy of 64.98%. Additionally, through external validation on the negative samples, our method still shows comparative performance with other existing methods. We believe that this study will be useful in elucidating the mechanism of MoRFs and facilitating hypothesis-driven experimental design and validation.
Co-reporter:Yongning Yang, Fanfan Xie, Bing Yan, Yi Li, Junmei Xu, Yuan Liu, Zhining Wen, Menglong Li
Chemometrics and Intelligent Laboratory Systems 2017 Volume 170(Volume 170) pp:
Publication Date(Web):15 November 2017
DOI:10.1016/j.chemolab.2017.08.012
•We propose a Raman spectrum-based model for multi-classification of tumor subtypes.•Predictive accuracy for each tumor subtype is higher than 0.85.•This study will be helpful for diagnosing the salivary gland tumor in preoperation.Pleomorphic adenoma (PA), Warthin's tumor (WT) and mucoepidermoid carcinoma (MEC) are three common subtypes of salivary gland tumors, for which the occurrence site is located in the parotid gland. Accurately diagnosing the subtypes of parotid tumors plays a vital role in the surgical treatment. Unfortunately, the current studies mainly focus on the binary classification of parotid tumors. The preoperative multi-classification of them has still been underexplored. For the purpose of broadening the application area of the predictive models and facilitating the clinical preoperative diagnosis, we suggested a multi-classification model, which was constructed by combining the variable combination population analysis (VCPA) algorithm with the partial least squares regression (PLSR), to simultaneously discriminate the three subtypes of parotid tumors as well as the normal parotid gland tissue based on the Raman spectra of the tissue samples. In addition, we investigated the impact of generating Raman spectra from different sampling locations on the reliability of the predictive models. For the validation set, the overall accuracy in predicting the subtypes of parotid tumors and the normal parotid gland tissue was 0.867. Similarly, the accuracies achieved by the models constructed with the Raman spectra from two different sampling locations were 0.877 and 0.883, respectively, indicating the minor influence of the sampling locations on the predictive models. Our findings can be helpful for establishing the method of rapidly diagnosing the salivary gland tumors preoperatively in clinics. Moreover, the characteristic wavenumbers used in model construction were highly associated with the variations of the structures and contents of nucleic acids, collagen, proteins, lipids and DNA/RNA in gland tissue, which revealed the mainly difference among three types of parotid tumors and can be conductive to better understanding the molecular mechanisms of them.
Co-reporter:Yiming Wu, Qifan Kuang, Yongcheng Dong, Ziyan Huang, Yan Li, Yizhou Li, Menglong Li
Chemometrics and Intelligent Laboratory Systems 2016 Volume 156() pp:224-230
Publication Date(Web):15 August 2016
DOI:10.1016/j.chemolab.2016.05.012
Benefiting from the high-throughput sequencing technologies, many single nucleotide variants (SNVs) among individuals have been detected. SNVs in gene code regions were known to possibly disrupt protein functions. For this, many efforts were devoted to sort deleterious SNVs from benign ones. In general, features in the past studies can be categorized into codon level, peptide level and protein level. While those at peptide level were in widespread use, few works have carried out a comprehensive analysis by combining three levels information.In the present work, we incorporated both codon and protein level information with peptide level information to predict disease-related SNVs. Taking the advantage of combinatory multiple level features, our method exhibited competitive performance against seven well-known classifiers. Additionally, by incorporating selective pressure score and protein–protein interaction (PPI) information, we found that the functional important proteins were protected through a pressure-resistant mechanism during the evolution. Although critical proteins were obviously related with more deleterious SNVs, these pathogenic SNVs were tend to under higher selective pressures comparing to the benign variants. These results support the ongoing researches about relation between genotype and phenotype.
Co-reporter:Jiesi Luo, Wenling Li, Zhongyu Liu, Yanzhi Guo, Xuemei Pu and Menglong Li
Analyst 2015 vol. 140(Issue 9) pp:3048-3056
Publication Date(Web):13 Mar 2015
DOI:10.1039/C5AN00311C
Many Gram-negative bacteria use the type I secretion system (T1SS) to translocate a wide range of substrates (type I secreted RTX proteins, T1SRPs) from the cytoplasm across the inner and outer membrane in one step to the extracellular space. Since T1SRPs play an important role in pathogen–host interactions, identifying them is crucial for a full understanding of the pathogenic mechanism of T1SS. However, experimental identification is often time-consuming and expensive. In the post-genomic era, it becomes imperative to predict new T1SRPs using information from the amino acid sequence alone when new proteins are being identified in a high-throughput mode. In this study, we report a two-level method for the first attempt to identify T1SRPs using sequence-derived features and the random forest (RF) algorithm. At the full-length sequence level, the results show that the unique feature of T1SRPs is the presence of variable numbers of the calcium-binding RTX repeats. These RTX repeats have a strong predictive power and so T1SRPs can be well distinguished from non-T1SRPs. At another level, different from that of the secretion signal, we find that a sequence segment located at the last 20–30 C-terminal amino acids may contain important signal information for T1SRP secretion because obvious differences were shown between the corresponding positions of T1SRPs and non-T1SRPs in terms of amino acid and secondary structure compositions. Using five-fold cross-validation, overall accuracies of 97% at the full-length sequence level and 89% at the secretion signal level were achieved through feature evaluation and optimization. Benchmarking on an independent dataset, our method could correctly predict 63 and 66 of 74 T1SRPs at the full-length sequence and secretion signal levels, respectively. We believe that this study will be useful in elucidating the secretion mechanism of T1SS and facilitating hypothesis-driven experimental design and validation.
Co-reporter:Xiuchan Xiao, Xiaojun Zeng, Yuan Yuan, Nan Gao, Yanzhi Guo, Xuemei Pu and Menglong Li
Physical Chemistry Chemical Physics 2015 vol. 17(Issue 4) pp:2512-2522
Publication Date(Web):21 Nov 2014
DOI:10.1039/C4CP04528A
G protein coupled receptors (GPCRs) play a crucial role in regulating signal recognition and transduction through their activation. The conformation transition in the activation pathway is of particular importance for their function. However, it has been poorly elucidated due to experimental difficulties in determining the conformations and the time limitation of conventional molecular dynamics (CMD) simulation. Thus, in this work, we employed a targeted molecular dynamic (TMD) simulation to study the activation process from an inactive structure to a fully active one for β2 adrenergic receptor (β2AR). As a reference, 110 ns CMD simulations on wild β2AR and its D130N mutant were also carried out. TMD results show that there is at least an intermediate conformation cluster in the activation process, evidenced by the principal component analysis and the structural and dynamic differences of some important motifs. It is noteworthy that the activation of the ligand binding site lags the G-protein binding site, displaying uncoupled correlation. Comparisons between the CMD and TMD results show that the D130N mutation significantly speeds up ICL2 and key ionic lock to enter into the intermediate state, which to some extent facilitates the activation involved in the NPxxY, DRY region and the separation between TM3 and TM6. However, the contribution from the D130N mutation to the activation of the ligand binding site could not be observed within the scale of 110 ns time. These observations could provide novel insights into previous studies for better understanding of the activation mechanism for β2AR.
Co-reporter:Xin Xu, Qifan Kuang, Yongqing Zhang, Huijun Wang, Zhining Wen and Menglong Li
Analytical Methods 2015 vol. 7(Issue 10) pp:4111-4122
Publication Date(Web):10 Apr 2015
DOI:10.1039/C5AY00699F
Recently, age-related changes in functional connectivity have gained more attention for investigating functional changes across development. In this study, we examine functional connectivity, as derived from resting state functional magnetic resonance imaging (R-fMRI), in 90 cortical and subcortical regions in two healthy groups in young adulthood (ages 18–28 years) and late adulthood (ages 63–73 years). Comparing the processes for constructing a functional network, we found that the network in young adulthood was more easily fully connected than that in late adulthood, indicating that the central regions, frontal lobe, parietal lobe and limbic lobe possibly occupy more resources in late adulthood. We confirmed that the brain in both young adulthood and late adulthood had a “small-world” organization, and that there was a further loss of small-world characteristics in late adulthood. Additionally, we found that late adulthood exhibited a more social-like organization of the brain functional network than young adulthood. Furthermore, the connectivity density showed a general decrease in most brain areas, but only the temporal lobe and occipital lobe showed a decrease in connectivity strength in late adulthood. Conversely, the parietal lobe showed an increase in connectivity density in late adulthood. Our study provides additional support for elucidating the functional changes of the brain across development, and characterizing these changes will lead to a better understanding of the cognitive decline that occurs with advancing age.
Co-reporter:Rong Li, Yongcheng Dong, Qifan Kuang, Yiming Wu, Yizhou Li, Min Zhu, Menglong Li
Chemometrics and Intelligent Laboratory Systems 2015 Volume 144() pp:71-79
Publication Date(Web):15 May 2015
DOI:10.1016/j.chemolab.2015.03.013
•Four novel features of drug are proposed for the prediction of drug–ADR associations.•A novel matrix-completion method called inductive matrix completion (IMC) was applied to predict ADRs.•IMC is outstanding to predict ADRs for both well-known drugs and less-characterized ones.•The cosine similarity for drugs based on drug–target adjacent matrix is proved to be prominent for IMC.Correctly and efficiently identifying associations between drugs and adverse drug reactions (ADRs) is critically important for drug development and clinical safety. Because of their low costs and high performance, many statistical and machine learning methods have been recently implemented to identify these associations. Most existing computer-aided methods for predicting ADRs mainly rely on known drug–ADR associations and achieve expected performances on overall data sets. However, they fail to predict ADRs for less-characterized drugs because of insufficient prior knowledge. To solve this problem, we present a novel method with new drug features. In this paper, we first applied a novel matrix-completion method called inductive matrix completion (IMC) to predict ADRs by combining features for drugs and ADRs. Then, similarities between drugs were calculated in different ways based on drug–target interactions. Finally, comprehensive validations were carried out to compare the new approach with four other typical approaches on various drug features. Comparison of approaches and features showed that no matter evaluated by tenfold cross-validation or prospective validation, IMC consistently performed well on both types of drugs, well-known or less studied. Moreover, the cosine similarity of drugs was prominent for IMC. Therefore, our method excels at predicting ADRs for less-characterized drugs.
Co-reporter:Yu Wang;Yanzhi Guo;Qifan Kuang;Xuemei Pu
Journal of Computer-Aided Molecular Design 2015 Volume 29( Issue 4) pp:349-360
Publication Date(Web):2015 April
DOI:10.1007/s10822-014-9827-y
The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein–ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients (Rp and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.
Co-reporter:Li Chen, Yongzhi Zhang, Chaohong Lin, Wen Yang, Yan Meng, Yong Guo, Menglong Li and Dan Xiao
Journal of Materials Chemistry A 2014 vol. 2(Issue 25) pp:9684-9690
Publication Date(Web):21 Mar 2014
DOI:10.1039/C4TA00501E
A facile, economical and effective method to produce hierarchically porous nitrogen-rich carbon (HPNC) from wheat straw has been reported. Acid pretreatment is introduced before KOH activation, and plays the role of promoting the formation of thinner pore walls. Without any N-doping, the N content is as high as 5.13%. The HPNC when used as an anode for Li-ion batteries exhibits a superior specific capacity of 1470 mA h g−1 at 0.037 A g−1, and possesses an ultrahigh rate capability of 344 mA h g−1 at 18.5 A g−1. Even at an extremely high current density of 37 A g−1, the reversible capacity is still as high as 198 mA h g−1.
Co-reporter:Yuelong Wang, Runyu Jing, Yongpan Hua, Yuanyuan Fu, Xu Dai, Liqiu Huang and Menglong Li
Analytical Methods 2014 vol. 6(Issue 17) pp:6832-6840
Publication Date(Web):17 Jun 2014
DOI:10.1039/C4AY01240B
Multi-family enzymes are of great importance in life, disease and other domains. However, in terms of the classification of enzymes, the information of multi-family enzymes is always removed from the dataset to account for the limitation of traditional single-label prediction methods. In order to predict multiple classes of multi-family enzymes, we adopted two multi-label learning algorithms, namely RAkEL-RF and MLKNN, and two types of protein descriptors, namely CTD and PseAAC, to generate four predictors, RAkEL-RF-CTD, RAkEL-RF-PseAAC, MLKNN-CTD and MLKNN-PseAAC. When the four predictors were tested on a training set with 10-fold cross validation, the overall success rates reached 97.99%, 96.07%, 96.01% and 95.31%, respectively. For the independent test set, the corresponding rates reached 97.57%, 95.03%, 95.9% and 93.9%, respectively. In conclusion, it proved the outstanding prediction capability and robustness of our predictors from the extremely small difference between two sets for each predictor and the relatively higher accuracy. In addition, three of seven pairs of homologous enzymes with different functions and eighteen of twenty-three distantly related enzymes with a similar family were correctly classified by the RAkEL-RF-CTD predictor. These results indicated the extensive applicability of our predictors.
Co-reporter:Yanping Jiang, Yizhou Li, Qifan Kuang, Ling Ye, Yiming Wu, Lijun Yang and Menglong Li
Analytical Methods 2014 vol. 6(Issue 8) pp:2692-2698
Publication Date(Web):21 Jan 2014
DOI:10.1039/C3AY42101E
Adverse drug reactions (ADRs) are one of the main issues restraining the development and clinical applications of new drugs. Owing to complicated molecular mechanisms of ADRs, various experimental and computational methods have been employed to detect them. It has been reported that a number of ADRs are induced by a series of actions triggered by drugs or their reactive metabolites that bind to therapeutic targets or other proteins involved in drug metabolism. The identification of these ADR-related proteins (ADRRPs) is an available avenue to explore adverse reactions of drugs. In this study, the human protein–protein interaction (PPI) network was constructed as a powerful tool for studying the molecular mechanisms of ADRs. Based on such a network, five network topological properties were calculated to characterize proteins quantitatively. Then a random forest model for ADRRP prediction was built which was dependent on these properties. The prediction model yielded a satisfactory result with a sensitivity of 87.3%, a specificity of 86.1% and an overall accuracy of 86.8%. Finally, text mining was applied to verify the predictions. Some of the predicted ADRRPs have been proved to be involved in regulating ADRs by experimental studies. The results suggested that the genome-wide human interaction network provides us with an effective channel for discovering ADRRPs.
Co-reporter:Jiesi Luo;Yanzhi Guo;Yun Zhong;Duo Ma
Journal of Computer-Aided Molecular Design 2014 Volume 28( Issue 6) pp:619-629
Publication Date(Web):2014 June
DOI:10.1007/s10822-014-9746-y
Protein–protein interactions (PPIs) play crucial roles in diverse cellular processes. There are different types of PPIs based on the composition, affinity and whether the association is permanent or transient. Analyzing the diversity of PPIs at the atomic level is crucial for uncovering the key features governing the interactions involved in PPI. A systematic physico-chemical and conformational studies were implemented on interfaces involved in different PPIs, including crystal packing, weak transient heterodimers, weak transient homodimers, strong transient heterodimers and homodimers. The comparative analysis shows that the interfaces tend to be larger, less planar, and more tightly packed with the increase of the interaction strength. Meanwhile the strong interactions undergo greater conformational changes than the weak ones involving main chains as well as side chains. Finally, using 18 features derived from our analysis, we developed a support vector regression model to predict the binding affinity with a promising result, which further demonstrate the reliability of our studies. We believe this study will provide great help in more thorough understanding the mechanism of diverse PPIs.
Co-reporter:Juan Zhang, Lifang Zhang, Gang Yang, Di Wu, Lina Jiang, Liqiu Huang, Zhining Wen, Menglong Li
Chemometrics and Intelligent Laboratory Systems 2013 Volume 126() pp:100-107
Publication Date(Web):15 July 2013
DOI:10.1016/j.chemolab.2013.05.004
•Heterogeneity of clinical samples is an obstacle to identify disease-related genes.•We propose nonnegative matrix factorization for gene expression data deconvolution.•Deconvoluted gene expression profile is more different between clinical conditions.•More disease-related genes are found by using deconvoluted gene expression profile.•This study will be the great benefit to clinical researches.Nowadays DNA microarray technology is widely used in clinical researches for generating gene expression profiles from the biological samples. Based on the gene expression data, identifying differentially expressed genes (DEGs) from two groups of phenotypes or distinct biological conditions is one of the crucial steps in the procedure of discovering disease biomarkers. However, the clinical samples usually contain multiple cell types. This heterogeneous cell population significantly affects the gene expression patterns and will mask the biological difference between two groups of compared samples. Using mixed gene expression profile of multiple cell types instead of that of interested cell type for the identification of DEGs will seriously decrease the sensitivity of discovering the disease-related genes. Therefore, we proposed nonnegative matrix factorization (NMF), an unsupervised learning method that has been successfully applied in bioinformatics researches, for extracting the actual gene expression profile of interested cell type from the mixed profile of heterogeneous cell population. In our study, we firstly evaluated the performance of NMF algorithm in the deconvolution of gene expression data by using a well-controlled data set comprising the gene expression profiles from three tissues and eleven different mixtures with known proportions. Then, NMF was applied to the human whole-blood gene expression data generated from 24 kidney transplant recipients for estimating the pure gene expression profiles of five major blood cells, which were subsequently used to identify the genes related to the acute rejection of kidney transplant. The results showed that the number of DEGs (probe sets), which were identified from each of the gene expression profiles of five blood cells between stable post-transplant kidney transplant recipients and those experiencing acute transplant rejections, was greater than that from whole-blood samples. Finally, the DEGs were uploaded to the Gene Set Enrichment Analysis (GSEA) for the enrichment of signaling pathways and gene ontology terms. We found that several enriched pathways and gene ontology terms were significantly associated with renal transplantation rejection when the uploaded DEGs were identified from the two high content blood cells, while none of pathways and gene ontology terms was enriched when the uploaded DEGs were identified from whole-blood samples. Our results indicated that using the gene expression profile of specific cell type deconvoluted by NMF can efficiently increase the sensitivity of discovering potentially disease-related genes. In addition, this unsupervised method can evaluate the pure gene expression profile of specific cell type from the mixtures with no prior knowledge of cell proportions.
Co-reporter:Jing Sun, Runyu Jing, Yuelong Wang, Tuanfei Zhu, Menglong Li, Yizhou Li
Computational Biology and Chemistry 2013 Volume 47() pp:8-15
Publication Date(Web):December 2013
DOI:10.1016/j.compbiolchem.2013.06.002
•PPM-Dom could predict the exact positions of each domain in any query proteins.•PPM-Dom could distinguish different domains in the same query sequence from each other.•PPM-Dom could figure out each part of the discontinuous domain regions.•The number of domains would be inferred effortlessly from the positions of domains.Domains are the structural basis of the physiological functions of proteins, and the prediction of which is an advantageous process on the study of protein structure and function. This article proposes a new complete automatic prediction method, PPM-Dom (Domain Position Prediction Method), for predicting the particular positions of domains in a target protein via its atomic coordinate. The presented method integrates complex networks, community division, and fuzzy mean operator (FMO). The whole sequences are divided into potential domain regions by the complex network and community division, and FMO allows the final determination for the domain position. This method will suffice to predict regions that will form a domain structure and those that are unstructured based on completely new atomic coordinate information of the query sequence, and be able to separate different domains in the same query sequence from each other. On evaluating the performance using an independent testing dataset, PPM-Dom reached 91.41% for prediction accuracy, 96.12% for sensitivity and 92.86% for specificity. The tool bag of PPM-Dom is freely available at http://cic.scu.edu.cn/bioinformatics/PPMDom.zip.Figure optionsDownload full-size imageDownload as PowerPoint slide
Co-reporter:Yongqing Zhang, Danling Zhang, Gang Mi, Daichuan Ma, Gongbing Li, Yanzhi Guo, Menglong Li, Min Zhu
Computational Biology and Chemistry 2012 Volume 36() pp:36-41
Publication Date(Web):February 2012
DOI:10.1016/j.compbiolchem.2011.12.003
In proteins, the number of interacting pairs is usually much smaller than the number of non-interacting ones. So the imbalanced data problem will arise in the field of protein–protein interactions (PPIs) prediction. In this article, we introduce two ensemble methods to solve the imbalanced data problem. These ensemble methods combine the based-cluster under-sampling technique and the fusion classifiers. And then we evaluate the ensemble methods using a dataset from Database of Interacting Proteins (DIP) with 10-fold cross validation. All the prediction models achieve area under the receiver operating characteristic curve (AUC) value about 95%. Our results show that the ensemble classifiers are quite effective in predicting PPIs; we also gain some valuable conclusions on the performance of ensemble methods for PPIs in imbalanced data. The prediction software and all dataset employed in the work can be obtained for free at http://cic.scu.edu.cn/bioinformatics/Ensemble_PPIs/index.html.Graphical abstractHighlights► Two ensemble methods are proposed to overcome the imbalanced problem in PPIs. ► These methods combine cluster-based under-sampling technique and fusion classifiers. ► Analysis the performance of these methods with different based classifiers. ► A web server has been developed in an easy-to-use manner.
Co-reporter:Lijuan Zhu, Wei Yang, Yan Yan Meng, Xiuchan Xiao, Yanzhi Guo, Xuemei Pu, and Menglong Li
The Journal of Physical Chemistry B 2012 Volume 116(Issue 10) pp:3292-3304
Publication Date(Web):February 9, 2012
DOI:10.1021/jp3002405
The use of enzymes in nonaqueous solvent has been one of the most exciting facets of enzymology in recent times; however, the mechanism of how organic solvent and essential water influence on structure and function of enzyme has been not satisfactorily explained in experiments, which limit its further application. Herein, we used molecular dynamics (MD) simulation to study γ-chymotrypsin in two types of media (viz., acetonitrile media with inclusion of 151 crystal water molecules and aqueous solution). On the basis of the MD result, the truncated active site modes containing two specific solvent molecules are furthered studied at the B3LYP/6-31+G(d,p) level of theory within the framework of PCM model. The results show that the acetontrile solvent gives rise to an extent deviation of enzyme structure from the native one, a drop in the flexibility and the total SASA of enzyme. The QM study further reveals that the structure variation of the active pocket caused by acetonitrile would lead to a weakened strength in the catalytic H-bond network, a drop in the pKa value of His57, and an increase in the proton transfer barriers from the Ser195 to the His57 residue, which may contribute to the drop in the enzymatic activity in acetontrile media. In addition, the crystal waters play an importance role in retaining the catalytic H-bond network and weakening the acetonitrile-induced variations above, which may be associated with the fact that the enzyme could retain catalytic activity in microhydration acetonitrile media.
Co-reporter:Fuyuan Tan, Chao Tan, Aiping Zhao, and Menglong Li
Journal of Agricultural and Food Chemistry 2011 Volume 59(Issue 20) pp:10839-10847
Publication Date(Web):September 6, 2011
DOI:10.1021/jf2023325
In this paper, a novel application of alternating penalty trilinear decomposition (APTLD) for high-performance liquid chromatography with fluorescence detection (HPLC-FLD) has been developed to simultaneously determine the contents of free amino acids in tea. Although the spectra of amino acid derivatives were similar and a large number of water-soluble compounds are coextracted, APTLD could predict the accurate concentrations together with reasonable resolution of chromatographic and spectral profiles for the amino acids of interest owing to its “second-order advantage”. An additional advantage of the proposed method is lower cost than traditional methods. The results indicate that it is an attractive alternative strategy for the routine resolution and quantification of amino acids in the presence of unknown interferences or when complete separation is not easily achieved.
Co-reporter:Wenjia Xiong;Yanzhi Guo;Menglong Li
The Protein Journal 2010 Volume 29( Issue 6) pp:427-431
Publication Date(Web):2010 August
DOI:10.1007/s10930-010-9269-x
Lipid–protein interactions play a vital role in various biological processes, which are involved in cellular functions and can affect the stability, folding and the function of peptides and proteins. In this study, a sequence-based method by using support vector machine and position specific scoring matrix (PSSM) was proposed to predict lipid-binding sites. Considering the influence of surrounding residues of one amino acid, a sliding window was chosen to encode the PSSM profiles. By incorporating the evolutionary information and the local features of residues surrounding one lipid-binding site, the method yielded a high accuracy of 80.86% and the Matthew’s Correlation Coefficient of 0.58 by using fivefold cross validation test. The good result indicates the applicability of the method.
Co-reporter:Jiajian Yin;Yuanbo Diao;Zhining Wen
International Journal of Peptide Research and Therapeutics 2010 Volume 16( Issue 2) pp:111-121
Publication Date(Web):2010 June
DOI:10.1007/s10989-010-9210-3
A quantitative multidimensional amino acids descriptors E (E1–E5) has been introduced in bioactive peptides Quantitative Structure–Activity Relationship (QSAR) study. These descriptors correlate well with hydrophobicity, size, preferences for amino acids to occur in α-helices, composition and the net charge, respectively. They were then applied to construct characterization and QSAR analysis on 48 angiotensin-converting enzyme (ACE) inhibitors dipeptides, 55 ACE inhibitors tripeptides and 48 bitter tasting dipeptides by support vector regression (SVR). The leave one out cross validation Q(CV)2 were 0.886, 0.985 and 0.912, the root mean square error (RMSE) were 0.250, 0.021 and 0.123, respectively. The results showed that, in comparison with the conventional descriptors, the new descriptor (E) is a useful structure characterization method for peptide QSAR analysis. The importance of each parameter or property at each position in peptides is estimated by the value of the model RMSE obtained using leave-one-parameter-out (LOPO) approach in the SVR model. This will be provided with certain guidance meaning to design and exploit peptide analogues. The results also indicate that SVR can be used as an alternative powerful modeling tool for peptide QSAR studies, and give one advice (LOPO) about evaluating the importance of parameter in SVR model. Moreover, it also offered an idea about nonlinear relation between bioactive of peptides and their structural descriptors E. The establishment of such methods will be a very meaningful work to peptide bioactive investigation in peptide analogue drug design.
Co-reporter:Xuan-Min Guang;Yan-Zhi Guo;Xia Wang
Interdisciplinary Sciences: Computational Life Sciences 2010 Volume 2( Issue 3) pp:241-246
Publication Date(Web):2010 September
DOI:10.1007/s12539-010-0044-7
Neurotoxin is a toxin which acts on nerve cells by interacting with membrane proteins. Different neurotoxins have different functions and sources. With much more knowledge of neurotoxins it would be greatly helpful for the development of drug design. The support vector machine (SVM) was used to predict the neurotoxin based on multiple feature vector descriptors, including the amino acid composition, length of the protein sequence, weight of the protein and the evolution information described by position specific scoring matrix (PSSM). After a five-fold cross-validation procedure, the method achieved an accuracy of 100% in discriminating neurotoxins from non-toxins. As for classifying neurotoxins based on their sources and functions, the accuracy was 99.50% and 99.38% respectively. At last, the method yielded a good performance in sub-classification of ion channels inhibitors with the total accuracy of 87.27%. These results indicate that this method outperforms previously described NTXpred method.
Co-reporter:Jiang Wu;Le-Zheng Yu;Chao Wang
The Protein Journal 2010 Volume 29( Issue 1) pp:62-67
Publication Date(Web):2010 January
DOI:10.1007/s10930-009-9222-z
The purpose of this article is to identify protein structural classes by using support vector machine (SVM) ensemble classifier, which is very efficient in enhancing prediction performance. Firstly, auto covariance (AC) and pseudo-amino acid composition (PseAAC) were used in protein representation. AC focuses on adjacent effects and PseAA composition takes sequence order patterns into account. Secondly, SVMs were trained on the datasets represented by different descriptors. The last, ensemble classifier, which constructed on the individual classifiers through a voting strategy, gave the final prediction results. Meanwhile, very promising prediction accuracy 93.14% was obtained by Jackknife test. The experimental results showed that the ensemble system can improve the prediction performance greatly and generate more stable and safer predictors. The current method featured by fusing the protein primary sequence information transferred by AC and described by protein PseAA composition may play an important complementary role in other related applications.
Co-reporter:Lirong Liu;Yaping Fang;Menglong Li;Cuicui Wang
The Protein Journal 2009 Volume 28( Issue 3-4) pp:175-181
Publication Date(Web):2009 May
DOI:10.1007/s10930-009-9181-4
β-Turn is a secondary protein structure type that plays an important role in protein configuration and function. Here, we introduced an approach of β-turn prediction that used the support vector machine (SVM) algorithm combined with predicted secondary structure information. The secondary structure information was obtained by using E-SSpred, a new secondary protein structure prediction method. A 7-fold cross validation based on the benchmark dataset of 426 non-homologous protein chains was used to evaluate the performance of our method. The prediction results broke the 80% Qtotal barrier and achieved Qtotal = 80.9%, MCC = 0.44, and Qpredicted higher 0.9% when compared with the best method. The results in our research are coincident with the conclusion that β-turn prediction accuracy can be improved by inclusion of secondary structure information.
Co-reporter:Jiang Wu;Yi-Zhou Li
Interdisciplinary Sciences: Computational Life Sciences 2009 Volume 1( Issue 4) pp:315-319
Publication Date(Web):2009 December
DOI:10.1007/s12539-009-0066-1
Machine learning methods play the very important role in protein secondary structure prediction and other related works. On condition of a certain approach, the prediction qualities mostly depend on the ways of representing protein sequences into numeric features. In this paper, two Support Vector Machine (SVM) multi-classification strategies, “one-against-one” (1-a-1) and “one-against-all” (1-a-a), were used in protein structural classes identification. Auto covariance (AC), which transforms the physicochemical properties of the amino acids of the proteins into a data matrix, focuses on the neighboring effects and the interactions between residues in protein sequences. “1-a-1” approach was used on SVM to predict protein structural classes and obtained very promising overall accuracy 90.69% by Jackknife test. It was more than 10% higher than the accuracy obtained by using “1-a-a”. Experimental results led to the finding that the SVM predictor constructed by “1-a-1” can avoid the appearance of biased prediction accuracy. This current method, using the protein primary sequence information described by auto covariance (AC) and “1-a-1” approach on SVM, should play an important complementary role in other related applications.
Co-reporter:Y. Diao;D. Ma;Z. Wen;J. Yin;J. Xiang;M. Li
Amino Acids 2008 Volume 34( Issue 1) pp:111-117
Publication Date(Web):2008 January
DOI:10.1007/s00726-007-0550-z
Transmembrane (TM) proteins represent about 20–30% of the protein sequences in higher eukaryotes, playing important roles across a range of cellular functions. Moreover, knowledge about topology of these proteins often provides crucial hints toward their function. Due to the difficulties in experimental structure determinations of TM protein, theoretical prediction methods are highly preferred in identifying the topology of newly found ones according to their primary sequences, useful in both basic research and drug discovery. In this paper, based on the concept of pseudo amino acid composition (PseAA) that can incorporate sequence-order information of a protein sequence so as to remarkably enhance the power of discrete models (Chou, K. C., Proteins: Structure, Function, and Genetics, 2001, 43: 246–255), cellular automata and Lempel-Ziv complexity are introduced to predict the TM regions of integral membrane proteins including both α-helical and β-barrel membrane proteins, validated by jackknife test. The result thus obtained is quite promising, which indicates that the current approach might be a quite potential high throughput tool in the post-genomic era. The source code and dataset are available for academic users at liml@scu.edu.cn.
Co-reporter:Y. Fang;Y. Guo;Y. Feng;M. Li
Amino Acids 2008 Volume 34( Issue 1) pp:103-109
Publication Date(Web):2008 January
DOI:10.1007/s00726-007-0568-2
DNA-binding proteins play a pivotal role in gene regulation. It is vitally important to develop an automated and efficient method for timely identification of novel DNA-binding proteins. In this study, we proposed a method based on alone the primary sequences of proteins to predict the DNA-binding proteins. DNA-binding proteins were encoded by autocross-covariance transform, pseudo-amino acid composition, dipeptide composition, respectively and also the different combinations of the three encoded methods; further, these feature matrices were applied to support vector machine classifiers to predict the DNA-binding proteins. All modules were trained and validated by the jackknife cross-validation test. Through comparing the performance of these substituted modules, the best result was obtained from pseudo-amino acid composition with the overall accuracy of 96.6% and the sensitivity of 90.7%. The results suggest that it can efficiently predict the novel DNA-binding proteins only using the primary sequences.
Co-reporter:Qing Xiong, Yuxi Zhang, Menglong Li
Analytica Chimica Acta 2007 Volume 593(Issue 2) pp:199-206
Publication Date(Web):19 June 2007
DOI:10.1016/j.aca.2007.04.060
Mass spectral classifiers of 16 substructures that are present in basic structures of pesticides have been investigated to assist pesticide residues analysis as well as screening of pesticide lead compounds. Mass spectral data are first transformed into 396 features, and then Genetic Algorithm-Partial Least Squares (GA-PLS) as a feature selection method and Support Vector Machine (SVM) as a validation method are implemented together to get an optimization feature set for each substructure. At last, a statistical method which is AdaBoost algorithm combined with Classification and Regression Tree (AdaBoost-CART) is trained to predict the 16 substructures presence/absence using the optimization mass spectral feature set. It is demonstrated that the optimum feature sets can be used to predict the 16 pesticide substructures presence/absence with mostly 85–100% in recognition success rate instead of the original 396 features.
Co-reporter:Xiao-Yu Feng, Qiu-Qi Wang, Jing Zhang, Fu-Sheng Nie, Meng-Long Li
Vibrational Spectroscopy 2007 Volume 44(Issue 2) pp:243-247
Publication Date(Web):17 July 2007
DOI:10.1016/j.vibspec.2006.12.002
In this work, a support vector machine (SVM)-based model was successfully developed to study the aromatic compounds in the form of infrared spectra. At first, the support vector machine and artificial neural networks (ANN) methods were applied to construct classifier system for aromatic compounds based on entire spectra. The results showed that both approaches performed well in identifying the adjacent functional group of aromatic compounds and SVM behaved appreciably better than ANN in distinguishing the substituted types of benzene. Hence, SVM was selected to further study the spectra–structure correlation based on segmental spectra. The experiment suggested that some segmental spectra may represent significant information concealed in entire spectra and C–H and C–C wagging out-of-plane vibration was the most important among the characteristic absorptions of benzene. A cross-validation procedure was used in all experiments.
Co-reporter:Zhimeng WANG;Lin JIANG;Menglong LI;Lina SUN;Rongying LIN
Acta Biochimica et Biophysica Sinica 2007 Volume 39(Issue 9) pp:715-721
Publication Date(Web):16 SEP 2007
DOI:10.1111/j.1745-7270.2007.00326.x
There are approximately 109 proteins in a cell. A hotspot in bioinformatics is how to identify a protein's subcellular localization, if its sequence is known. In this paper, a method using fast Fourier transform-based support vector machine is developed to predict the subcellular localization of proteins from their physicochemical properties and structural parameters. The prediction accuracies reached 83% in prokaryotic organisms and 84% in eukaryotic organisms with the substitution model of the c-p-v matrix (c, composition; p, polarity; and v, molecular volume). The overall prediction accuracy was also evaluated using the “leave-one-out” jackknife procedure. The influence of the substitution model on prediction accuracy has also been discussed in the work. The source code of the new program is available on request from the authors.
Co-reporter:Li-Xia LIU;Fu-Yuan TAN;Min-Chun LU;Ke-Long WANG;Yan-Zhi GUO;Zhi-Ning WEN;Lin JIANG
Acta Biochimica et Biophysica Sinica 2006 Volume 38(Issue 6) pp:363-371
Publication Date(Web):15 JUN 2006
DOI:10.1111/j.1745-7270.2006.00177.x
Abstract In our previous work, we developed a computational tool, PreK-ClassK-ClassKv, to predict and classify potassium (K+) channels. For K+ channel prediction (PreK) and classification at family level (ClassK), this method performs well. However, it does not perform so well in classifying voltage-gated potassium (Kv) channels (ClassKv). In this paper, a new method based on the local sequence information of Kv channels is introduced to classify Kv channels. Six transmembrane domains of a Kv channel protein are used to define a protein, and the dipeptide composition technique is used to transform an amino acid sequence to a numerical sequence. A Kv channel protein is represented by a vector with 2000 elements, and a support vector machine algorithm is applied to classify Kv channels. This method shows good performance with averages of total accuracy (Acc), sensitivity (SE), specificity (SP), reliability (R) and Matthews correlation coefficient (MCC) of 98.0%, 89.9%, 100%, 0.95 and 0.94 respectively. The results indicate that the local sequence information-based method is better than the global sequence information-based method to classify Kv channels.
Edited by Juan LIU
Co-reporter:Yan-Zhi GUO;Ke-Long WANG;Zhi-Ning WEN;Min-Chun LU;Li-Xia LIU;Lin JIANG
Acta Biochimica et Biophysica Sinica 2005 Volume 37(Issue 11) pp:
Publication Date(Web):15 NOV 2005
DOI:10.1111/j.1745-7270.2005.00110.x
Abstract Although the sequence information on G-protein coupled receptors (GPCRs) continues to grow, many GPCRs remain orphaned (i.e. ligand specificity unknown) or poorly characterized with little structural information available, so an automated and reliable method is badly needed to facilitate the identification of novel receptors. In this study, a method of fast Fourier transform-based support vector machine has been developed for predicting GPCR subfamilies according to protein's hydrophobicity. In classifying Class B, C, D and F subfamilies, the method achieved an overall Matthew's correlation coefficient and accuracy of 0.95 and 93.3%, respectively, when evaluated using the jackknife test. The method achieved an accuracy of 100% on the Class B independent dataset. The results show that this method can classify GPCR subfamilies as well as their functional classification with high accuracy. A web server implementing the prediction is available at http://chem.scu.edu.cn/blast/Pred-GPCR.
Edited by Lu-Hua LAI
Co-reporter:Zhi-ning Wen, Ke-long Wang, Meng-long Li, Fu-sheng Nie, Yi Yang
Computational Biology and Chemistry 2005 Volume 29(Issue 3) pp:220-228
Publication Date(Web):June 2005
DOI:10.1016/j.compbiolchem.2005.04.007
This paper applies discrete wavelet transform (DWT) with various protein substitution models to find functional similarity of proteins with low identity. A new metric, ‘S’ function, based on the DWT is proposed to measure the pair-wise similarity. We also develop a segmentation technique, combined with DWT, to handle long protein sequences. The results are compared with those using the pair-wise alignment and PSI-BLAST.
Co-reporter:Lezheng Yu, Yanzhi Guo, Zheng Zhang, Yizhou Li, Menglong Li, Gongbing Li, Wenjia Xiong, Yuhong Zeng
Peptides (April 2010) Volume 31(Issue 4) pp:574-578
Publication Date(Web):1 April 2010
DOI:10.1016/j.peptides.2009.12.026
In contrast to a large number of classically secreted proteins (CSPs) and non-secreted proteins (NSPs), only a few proteins have been experimentally proved to enter non-classical secretory pathways. So it is difficult to identify non-classically secreted proteins (NCSPs), and no methods are available for distinguishing the three types of proteins simultaneously. In order to solve this problem, a data mining has been taken firstly, and mammalian proteins exported via ER-Golgi-independent pathways are collected through extensive literature searches. In this paper, a support vector machine (SVM)-based ternary classifier named SecretP is proposed to predict mammalian secreted proteins by using pseudo-amino acid composition (PseAA) and five additional features. When distinguishing the three types of proteins, SecretP yielded an accuracy of 88.79%. Evaluating the performance of our method by an independent test set of 92 human proteins, 76 of them are correctly predicted as NCSPs. When performed on another public independent data set, the prediction result of SecretP is comparable to those of other existing computational methods. Therefore, SecretP can be a useful supplementary tool for future secretome studies. The web server SecretP and all supplementary tables listed in this paper are freely available at http://cic.scu.edu.cn/bioinformatics/secretp/index.htm.
Co-reporter:Qifan Kuang, Yizhou Li, Yiming Wu, Rong Li, Yongcheng Dong, Yan Li, Qing Xiong, Ziyan Huang, Menglong Li
Chemometrics and Intelligent Laboratory Systems (15 March 2017) Volume 162() pp:104-110
Publication Date(Web):15 March 2017
DOI:10.1016/j.chemolab.2017.01.016
Co-reporter:
Analytical Methods (2009-Present) 2014 - vol. 6(Issue 17) pp:NaN6840-6840
Publication Date(Web):2014/06/17
DOI:10.1039/C4AY01240B
Multi-family enzymes are of great importance in life, disease and other domains. However, in terms of the classification of enzymes, the information of multi-family enzymes is always removed from the dataset to account for the limitation of traditional single-label prediction methods. In order to predict multiple classes of multi-family enzymes, we adopted two multi-label learning algorithms, namely RAkEL-RF and MLKNN, and two types of protein descriptors, namely CTD and PseAAC, to generate four predictors, RAkEL-RF-CTD, RAkEL-RF-PseAAC, MLKNN-CTD and MLKNN-PseAAC. When the four predictors were tested on a training set with 10-fold cross validation, the overall success rates reached 97.99%, 96.07%, 96.01% and 95.31%, respectively. For the independent test set, the corresponding rates reached 97.57%, 95.03%, 95.9% and 93.9%, respectively. In conclusion, it proved the outstanding prediction capability and robustness of our predictors from the extremely small difference between two sets for each predictor and the relatively higher accuracy. In addition, three of seven pairs of homologous enzymes with different functions and eighteen of twenty-three distantly related enzymes with a similar family were correctly classified by the RAkEL-RF-CTD predictor. These results indicated the extensive applicability of our predictors.
Co-reporter:Li Chen, Yongzhi Zhang, Chaohong Lin, Wen Yang, Yan Meng, Yong Guo, Menglong Li and Dan Xiao
Journal of Materials Chemistry A 2014 - vol. 2(Issue 25) pp:NaN9690-9690
Publication Date(Web):2014/03/21
DOI:10.1039/C4TA00501E
A facile, economical and effective method to produce hierarchically porous nitrogen-rich carbon (HPNC) from wheat straw has been reported. Acid pretreatment is introduced before KOH activation, and plays the role of promoting the formation of thinner pore walls. Without any N-doping, the N content is as high as 5.13%. The HPNC when used as an anode for Li-ion batteries exhibits a superior specific capacity of 1470 mA h g−1 at 0.037 A g−1, and possesses an ultrahigh rate capability of 344 mA h g−1 at 18.5 A g−1. Even at an extremely high current density of 37 A g−1, the reversible capacity is still as high as 198 mA h g−1.
Co-reporter:
Analytical Methods (2009-Present) 2014 - vol. 6(Issue 8) pp:NaN2698-2698
Publication Date(Web):2014/01/21
DOI:10.1039/C3AY42101E
Adverse drug reactions (ADRs) are one of the main issues restraining the development and clinical applications of new drugs. Owing to complicated molecular mechanisms of ADRs, various experimental and computational methods have been employed to detect them. It has been reported that a number of ADRs are induced by a series of actions triggered by drugs or their reactive metabolites that bind to therapeutic targets or other proteins involved in drug metabolism. The identification of these ADR-related proteins (ADRRPs) is an available avenue to explore adverse reactions of drugs. In this study, the human protein–protein interaction (PPI) network was constructed as a powerful tool for studying the molecular mechanisms of ADRs. Based on such a network, five network topological properties were calculated to characterize proteins quantitatively. Then a random forest model for ADRRP prediction was built which was dependent on these properties. The prediction model yielded a satisfactory result with a sensitivity of 87.3%, a specificity of 86.1% and an overall accuracy of 86.8%. Finally, text mining was applied to verify the predictions. Some of the predicted ADRRPs have been proved to be involved in regulating ADRs by experimental studies. The results suggested that the genome-wide human interaction network provides us with an effective channel for discovering ADRRPs.
Co-reporter:Xiuchan Xiao, Xiaojun Zeng, Yuan Yuan, Nan Gao, Yanzhi Guo, Xuemei Pu and Menglong Li
Physical Chemistry Chemical Physics 2015 - vol. 17(Issue 4) pp:NaN2522-2522
Publication Date(Web):2014/11/21
DOI:10.1039/C4CP04528A
G protein coupled receptors (GPCRs) play a crucial role in regulating signal recognition and transduction through their activation. The conformation transition in the activation pathway is of particular importance for their function. However, it has been poorly elucidated due to experimental difficulties in determining the conformations and the time limitation of conventional molecular dynamics (CMD) simulation. Thus, in this work, we employed a targeted molecular dynamic (TMD) simulation to study the activation process from an inactive structure to a fully active one for β2 adrenergic receptor (β2AR). As a reference, 110 ns CMD simulations on wild β2AR and its D130N mutant were also carried out. TMD results show that there is at least an intermediate conformation cluster in the activation process, evidenced by the principal component analysis and the structural and dynamic differences of some important motifs. It is noteworthy that the activation of the ligand binding site lags the G-protein binding site, displaying uncoupled correlation. Comparisons between the CMD and TMD results show that the D130N mutation significantly speeds up ICL2 and key ionic lock to enter into the intermediate state, which to some extent facilitates the activation involved in the NPxxY, DRY region and the separation between TM3 and TM6. However, the contribution from the D130N mutation to the activation of the ligand binding site could not be observed within the scale of 110 ns time. These observations could provide novel insights into previous studies for better understanding of the activation mechanism for β2AR.