The final, formatted version of the article will be published soon.
METHODS article
Front. Genet.
Sec. Computational Genomics
Volume 15 - 2024 |
doi: 10.3389/fgene.2024.1400228
OCRClassifier: a method combining control charts and machine learning for accurately detecting open states of chromatin
Provisionally accepted- School of Computer Science and Technology, Xi’an Jiaotong University, Xi'an, China
Open chromatin regions (OCRs) play a crucial role in transcriptional regulation and gene expression. Identifying OCRs using techniques like ATAC-seq or DNase-seq can be time-consuming. In recent years, there has been a growing interest in using plasma cell-free DNA (cfDNA) sequencing data to detect OCRs. By analyzing the characteristics of cfDNA fragments and their sequencing coverage, researchers can differentiate OCRs from non-OCRs. However, the presence of noise and variability in cfDNA-seq data poses challenges for the training data used in the noise-tolerance learning-based OCR estimation approach, as it contains numerous noisy labels that may impact the accuracy of the results. For current methods of detecting OCRs, they rely on statistical features derived from typical open and closed chromatin regions to determine whether a region is OCR or non-OCR. However, there are some atypical regions that exhibit statistical features that fall between the two categories, making it difficult to classify them definitively as either open or closed chromatin regions (CCRs). These regions should be considered as partially open chromatin regions (pOCRs). In this paper, we present OCRClassifier, a novel framework that combines control charts and machine learning to address the impact of highproportion noisy labels in the training set and classify the chromatin open states into three classes accurately. Our method comprises two control charts. We first design a robust Hotelling T 2 control chart and create new run rules to accurately identify reliable OCRs and CCRs within the initial training set. Then, we exclusively utilize the pure training set consisting of OCRs and CCRs to create and train a sensitized T 2 control chart. This sensitized T 2 control chart is specifically designed to accurately differentiate between the three categories of chromatin states: open, partially open, and closed. Experimental results demonstrate that under this framework, the model exhibits not only excellent performance in terms of three-class classification, but also higher accuracy and sensitivity in binary classification compared to the state-of-the-art models currently available.
Keywords: cell-free DNA, Open chromatin region, Sequencing Data Analysis, Multivariate control chart, Noisy label
Received: 13 Mar 2024; Accepted: 07 May 2024.
Copyright: © 2024 Wang, Lai, Liu, Liu and Zhu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Jiayin Wang, School of Computer Science and Technology, Xi’an Jiaotong University, Xi'an, China
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.