About this Research Topic
The elephant in the proverbial room is the dataset used. All data scientists and AI engineers are painfully aware of the difficulties in collecting, curating, cleaning, governing, and securing a quality dataset that is representative, significant and informative about the situation to be modeled. In fact, getting to the point of having this kind of dataset can easily consume over 80% of the temporal and financial resources for the entire project.
In this Research Topic we will focus on ‘data-centric’ radiology AI. The papers will address and exemplify best practices to do with moving from a good idea to having a great dataset. The actual AI model selecting, training, tuning, and deploying efforts will be delegated to the huge body of literature that already exists.
Modes of collecting and storing radiological data are important not only due to regulatory reasons but also for various practical challenges unique to the medical industry, such as incompatible electronic health record systems. We are also challenged by having to maintain significant amounts of meta-data that are personal and sensitive. The extent to which the cloud may be used is treated as a panacea by some and as a precipice by others.
Most of the data that can be easily collected represents the ‘normal’ case. Often, we do not wish to model the normal case however. Then we may have to curate the dataset in such a way that the interesting cases are sufficiently well represented. For image data to be helpful for AI, it must usually be labeled. Labeling is a crucially important step that is usually ignored as the black sheep in the AI family. Once the sheer size of the dataset grows beyond a certain point, the IT system managing this data becomes important and must be designed correctly so that it can serve the purpose.
This Research Topic will cover all of these themes that are preparatory for AI. We find, nowadays, that the real lever in model quality is the dataset. Let’s investigate how to create the best datasets so as to achieve the best possible models.
In particular, we welcome articles that address the following questions, but is not limited to:
1. How can we best collect radiological data from many disparate sources worldwide?
2. How can we take into account different legal frameworks (e.g. HIPAA and GDPR) into this dataset?
3. How can we effectively spot irrelevant, superfluous, damaged or somehow not useful data in a large radiological dataset?
4. What are best practices to dealing with an imbalanced dataset – a dataset in which the number of cases in one category far outweigh the number of cases in another category?
5. How can we measure the significance and relevance of a dataset to a particular modeling question?
6. What are the best practices for labeling radiological images?
7. What are the best practices for radiological data governance?
8. What are the challenges in merging two existing datasets – they may not share each other’s macroscopic statistical properties?
9. What pre-processing steps are helpful in order to prepare raw radiological data for AI?
10. And so on.
Keywords: Data Preparation, Artificial Intelligence
Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.