While there have been great advances in data analytics in recent years including distributed computing for Big Data, machine learning including deep learning, less attention has been paid to the data curation and data governance processes supporting data analytics. A common complaint is that data scientists spend 80% of their time preparing data for analysis and only 20% of the time in the actual analysis. This is because the tools and methods used in data preparation require a substantial amount of human time and effort for tasks such as data quality analysis, data cleaning, data enhancement, data standardization, data integration, testing, and validation. Data preparation is just one phase of data curation, the management of data through its entire life cycle from acquisition to disposal. Furthermore, as organizations realize the value of their data, they are implementing data governance programs to ensure they have a complete inventory of their data and its contents, and a way to exercise authority and accountability over data as an organizational asset. As with data curation, most data governance processes require substantial human time and effort to be effective.
The aim of this Research Topic is to examine the Automated Data Curation and Data Governance Automation research to develop unsupervised methods and techniques to automate data curation and data governance processes to the greatest extent possible. The goal of fully automating data cleaning and integration has been labeled as a “data washing machine” by Richard Wang with some initial development led by John R. Talburt. Similar work has begun in the industry to develop methods for automating many of the data governance tasks, such as “positive data control” for maintaining the enterprise data catalog. Replacing human analysis with scalable, unsupervised automation of these processes will not be easy but necessary to keep pace with the increasing volume and variety of data driving modern decision systems.
Submissions to this Research Topic can address but are not limited to the following themes within the context of automated methods for:
• Data quality assessment and metrics
• Generating data quality validation rules
• Data cleansing (data washing machines)
• Spelling correction
• Missing value imputation
• Data standardization
• Multi-source data integration
• Entity and identity resolution
• Data governance policy and standards conformance
• Metadata generation
• Data catalog initialization and setup
• Updating data catalogs and business glossaries
• Data operations logging and data provenance
• Positive data control
• Generating data products
• Data as a service
• Data archiving, deletion, and disposal
While there have been great advances in data analytics in recent years including distributed computing for Big Data, machine learning including deep learning, less attention has been paid to the data curation and data governance processes supporting data analytics. A common complaint is that data scientists spend 80% of their time preparing data for analysis and only 20% of the time in the actual analysis. This is because the tools and methods used in data preparation require a substantial amount of human time and effort for tasks such as data quality analysis, data cleaning, data enhancement, data standardization, data integration, testing, and validation. Data preparation is just one phase of data curation, the management of data through its entire life cycle from acquisition to disposal. Furthermore, as organizations realize the value of their data, they are implementing data governance programs to ensure they have a complete inventory of their data and its contents, and a way to exercise authority and accountability over data as an organizational asset. As with data curation, most data governance processes require substantial human time and effort to be effective.
The aim of this Research Topic is to examine the Automated Data Curation and Data Governance Automation research to develop unsupervised methods and techniques to automate data curation and data governance processes to the greatest extent possible. The goal of fully automating data cleaning and integration has been labeled as a “data washing machine” by Richard Wang with some initial development led by John R. Talburt. Similar work has begun in the industry to develop methods for automating many of the data governance tasks, such as “positive data control” for maintaining the enterprise data catalog. Replacing human analysis with scalable, unsupervised automation of these processes will not be easy but necessary to keep pace with the increasing volume and variety of data driving modern decision systems.
Submissions to this Research Topic can address but are not limited to the following themes within the context of automated methods for:
• Data quality assessment and metrics
• Generating data quality validation rules
• Data cleansing (data washing machines)
• Spelling correction
• Missing value imputation
• Data standardization
• Multi-source data integration
• Entity and identity resolution
• Data governance policy and standards conformance
• Metadata generation
• Data catalog initialization and setup
• Updating data catalogs and business glossaries
• Data operations logging and data provenance
• Positive data control
• Generating data products
• Data as a service
• Data archiving, deletion, and disposal