Interlaboratory Studies Using the NISTmAb to Advance Biopharmaceutical Structural Analytics

Yandrofski, Katharina; Mouchahoir, Trina; De Leoz, M. Lorna; Duewer, David; Hudgens, Jeffrey W.; Anderson, Kyle W.; Arbogast, Luke; Delaglio, Frank; Brinson, Robert G.; Marino, John P.; Phinney, Karen; Tarlov, Michael; Schiel, John E.

doi:10.3389/fmolb.2022.876780

REVIEW article

Front. Mol. Biosci., 05 May 2022

Sec. Structural Biology

Volume 9 - 2022 | https://doi.org/10.3389/fmolb.2022.876780

This article is part of the Research TopicStructure-Function Metrology of ProteinsView all 13 articles

Interlaboratory Studies Using the NISTmAb to Advance Biopharmaceutical Structural Analytics

Katharina Yandrofski¹*

Trina Mouchahoir¹

David Duewer³

Luke Arbogast¹

Robert G. Brinson¹

Karen Phinney³

¹Institute for Bioscience and Biotechnology Research, National Institute of Standards and Technology, Rockville, MD, United States
²Agilent Technologies, Santa Clara, CA, United States
³National Institute of Standards and Technology, Gaithersburg, MD, United States

Biopharmaceuticals such as monoclonal antibodies are required to be rigorously characterized using a wide range of analytical methods. Various material properties must be characterized and well controlled to assure that clinically relevant features and critical quality attributes are maintained. A thorough understanding of analytical method performance metrics, particularly emerging methods designed to address measurement gaps, is required to assure methods are appropriate for their intended use in assuring drug safety, stability, and functional activity. To this end, a series of interlaboratory studies have been conducted using NISTmAb, a biopharmaceutical-representative and publicly available monoclonal antibody test material, to report on state-of-the-art method performance, harmonize best practices, and inform on potential gaps in the analytical measurement infrastructure. Reported here is a summary of the study designs, results, and future perspectives revealed from these interlaboratory studies which focused on primary structure, post-translational modifications, and higher order structure measurements currently employed during biopharmaceutical development.

Introduction

Monoclonal antibodies (mAbs) have become the most prevalent biopharmaceutical modality, used to treat indications from viral infections to cancer. Numerous other protein-based drugs continue to evolve including antibody drug conjugates (ADCs), bispecifics, coagulation factors, and cytokines, among others. In addition, new modalities such as viral vector mediated gene therapies and vaccines, mRNA vaccines, and adoptive cell therapies have more recently emerged to fill previously unmet medical needs. Common to all modalities is the need for comprehensive structural characterization, identification of relevant critical quality attributes, and quality control of these features to maintain safety and efficacy. Lessons learned regarding analytical best practices for mAbs, perhaps the most widely characterized and understood from a structure-function perspective, can be ported to other modalities.

Comprehensive evaluation of the fundamental performance metrics and analytical capability of a technology are a pivotal first step prior to adapting, translating, or evolving measurement methods to new systems. Innovative analytical technologies are often performed to enable deep characterization, elucidate mechanisms of action, or better understand the intricacies of a manufacturing process. These emerging technologies, despite their potential for profound leaps in product or process understanding, may have limited historical precedence, thus presenting a barrier to rapid adoption. Interlaboratory studies may serve to lower these barriers by providing a means of harmonizing technical approaches, reporting community-wide performance metrics (i.e. precision), defining method best practices, and/or understanding the underpinning principles and sources of uncertainty in a measurement system.

Publicly available biopharmaceutical product-representative test materials are a pre-requisite to interlaboratory evaluation of community wide performance metrics. The NIST monoclonal antibody (NISTmAb) was introduced as a tool for advancing analytical methods pertaining to monoclonal antibodies. The NISTmAb reference material (RM) 8,671 is a recombinant humanized IgG1κ expressed in murine suspension cell culture that has undergone biopharmaceutical industry standard upstream and downstream purification to remove process related impurities. This RM is intended primarily for use in evaluating the performance of methods for determining physicochemical and biophysical attributes of monoclonal antibodies. It also provides a representative test molecule for development of novel technologies for therapeutic protein characterization. (Mouchahoir and Schiel, 2018; Schiel and Turner, 2018; Schiel et al., 2018; Turner and Schiel, 2018; Turner et al., 2018). The NISTmAb first debuted in a series of small interlaboratory characterization studies in 2015. (Schiel et al., 2014; Schiel et al., 2015a; Schiel et al., 2015b). This series of reports provided a useful baseline to identify measurements for which method advancement and regulatory assimilation would benefit from additional technology development and interlaboratory studies. Highly focused interlaboratory studies have since been reported by NIST and independent communities to harmonize best practices, deepen community consensus on method performance, and document a baseline performance upon which future method evolution may be based (DeLeoz et al., 2017; Hudgens et al., 2019a; Brinson et al., 2019; Coffman et al., 2020; Srzentić et al., 2020; Mouchahoir et al., 2021). A number of those interlaboratory studies are reviewed here, with the intention to spur future partnerships to target additional assays and/or method evolution through well-planned interlaboratory studies.

The design, coordination, implementation, writing, and publishing of each of these studies is an extensive community-wide effort that spans multiple years (Figure 1). Planning and recruitment stages of an interlaboratory study coincide and are often synergistic. Sample sets, measurement protocols, and reporting structures can evolve based on community feedback, considering study design is exceedingly difficult to change post-launch. Sample identity and preparation is a critical step most often performed by the study organizers, but in consultation with participants. This stage may involve alteration of material properties to “challenge” the analytical method and/or prepare the sample for analysis (i.e. digestion, mixing, or vialing). Samples must be suitable with respect to material properties for the intended use in the measurement system, be non-proprietary to enable public dissemination/publication of results, and be of sufficient stability, homogeneity, and purity.

FIGURE 1

FIGURE 1. Timeline of Global NISTmAb Interlaboratory Studies. (A) Representative timeline identifying the key milestones for an interlaboratory study. (B) Corresponding dates and key milestones for each NISTmAb interlaboratory study (MAM, glycosylation, NMR HOS, and HDX-MS).^a MAM New Peak Detection Publication (Mouchahoir et al., 2021).^b MAM Attribute Analytics Publication (In progress).^c Glycosylation Interagency Internal Report (DeLeoz et al., 2017). ^d Glycosylation Interlaboratory Publication (De Leoz et al., 2020).^e NMR HOS Interlaboratory Publication (Brinson et al., 2019).^f HDX-MS Interlaboratory Publication (Hudgens et al., 2019a)

The measurement phase of an interlaboratory study is conducted at each partners’ individual site. Participants are typically asked to complete a pre-defined report template and when possible, include raw data. Although reporting is templated, participation in such a study involves a significant commitment by participants and their parent organizations. The study design is intended to minimize financial and time commitment burden on the participants, but this aspect should not be overlooked as the intellectual engagement of experts in the field are critical to industry-relevant impact. Submitted datasets are most commonly anonymized via third party vendors to protect potential intellectual property, after which combined analysis of the anonymized data is conducted by the study organizers. Analysis, interpretation, and formulating the discussion around results are again a community effort involving all study participants. Numerous iterations of data analyses and participant feedback led to a first draft, initially approved by all co-authors, and then sent to partners legal for review. Use of a non-competitive material and data anonymization are critical to assure freedom to operate. Writing, submission, and acceptance can be a rather lengthy process to allow all authors and partner institutions to ultimately agree on the presentation and interpretation of results. Each of the studies reviewed herein are a consensus of 15–75 organizations. Interlaboratory studies represent broad industry commitment to achieve a high degree of unity and enable implementation of current best practices, evolve analytical methods, and expedite their uptake. A representative sampling of NISTmAb interlaboratory studies are reviewed here, specifically those that included one or more NIST organizers (multi-attribute method, glycosylation analysis, nuclear magnetic resonance, and hydrogen deuterium exchange interlaboratory studies). Each study had a slightly unique design and output, as necessitated by the intricacies of that particular method, which are reported herein to include the study Purpose and Method Description, Summary of Results, and Learnings and Future Perspectives. (Hudgens et al., 2019a; Brinson et al., 2019; De Leoz et al., 2020; Mouchahoir et al., 2021).

Multi-Attribute Method Interlaboratory Study

Purpose and Method Description

The multi-attribute method (MAM) builds upon industry experience with mass spectrometry (MS)-based peptide mapping (Formolo et al., 2015; Rogstad et al., 2017; Noor et al., 2020) and holds promise for use in the quality control (QC) space (Rogers et al., 2013; Rogers et al., 2015; Xu et al., 2017; Zhang and Guo, 2017; Rogstad et al., 2019; Millán-Martín et al., 2020; Sokolowska et al., 2020; Zhang et al., 2020; Mouchahoir et al., 2021). MAM is designed to monitor the status of pre-defined quality attributes within a therapeutic protein sample (e.g., post-translational modifications (PTMs), enzymatic clips, isomerization, etc.) and/or detect process impurities (e.g., host cell proteins) in the sample. The basic workflow of a MAM platform begins with enzymatic digestion of the therapeutic protein, followed by separation of the resulting peptides by liquid chromatography (LC) and identification of the peptides by high-resolution mass spectrometry detection. Elegant software platforms are then used to interrogate the data to monitor changes in PTM levels within the sample (i.e., attribute analytics) and/or to detect impurities and unanticipated PTM changes in a non-targeted manner by comparison of the sample to a reference prepared in parallel (i.e., new peak detection (NPD). NPD is performed by first aligning reference and test sample data files according to m/z and retention time as depicted in Figure 2A. The data undergo “peak picking” where ions that meet a predetermined signal threshold (the new peak detection threshold) and display typical peptide isotope distributions are designated as peaks (bounded by blue, green or brown boxes in Figure 2A). The peaks detected in each sample are compared to the corresponding peaks (i.e., m/z and retention time match within a set tolerance) in the other sample. Peaks present in the test sample but not the reference sample are reported as new peaks, conversely peaks present in the reference sample but not in the test sample are missing peaks. If a peak is present in both samples and the difference in abundance between samples surpasses a set threshold (the fold-change threshold), it is reported as a changed peak. Unchanged peaks (below the fold-change threshold) are not reported. Prior knowledge of peak identity is not required; thus, NPD is an untargeted analysis that can potentially detect unexpected impurities or differences.

FIGURE 2

FIGURE 2. Overview of MAM New Peak Detection Data Analysis. (A) A representation of new peak detection data is shown for a single charge state/isotope cluster for a new peak, changed peak, and unchanged peak (B) Peaks Reported as New, Missing, or Changed in Spike and Unknown Samples. New, missing, and changed peaks detected in the Spike (S) and Unknown (U) Samples were reported by each participant. For the Spike Sample, peaks that conformed to expectation are represented in blue: Spike Peptides, Modified Spike Peptides (e.g., Spike Peptide with a PTM) and Spike Peptide Impurities (e.g., Spike Peptide with additional residue, truncation, etc.); peaks that did not conform to expectation are represented in red: NISTmAb Peptides, Unidentified Peaks, Contaminants. Peaks detected in the Unknown Sample did not conform to expectation and are represented in red without further categorization. One participant self-reported peaks in the Unknown Sample as false positives (represented in yellow) and thus were counted as a conforming result. Each participant is represented by a unique symbol. This figure was adapted from Mouchahoir et al., 2021 (https://pubs.acs.org/doi/10.1021/jasms.0c00415), with permission from ACS Publications; further permissions related to this material should be directed to ACS.

The potential for MAM to be implemented in the QC space as a replacement for a number of single-attribute assays has piqued the interest of the biopharmaceutical industry, and members of the industry are currently working to develop the platform for such use. Naturally, the adoption of new platforms comes with inherent risk which can slow the implementation of new technologies. The MAM interlaboratory study was therefore established to aid industry members at the beginning stages of MAM development and to provide a survey of the current performance of MAM across the industry (Mouchahoir et al., 2021). The study used the NISTmAb as a model therapeutic-like protein to evaluate the instrumentation and software processing used for MAM, and here we discuss the portion of the study that evaluated the NPD function of MAM. This was the first such industry-wide study to evaluate the performance of MAM.

Study Design and Protocols. Twenty-eight participating laboratories were recruited from members of the MAM Consortium (www.mamconsortium.org) and included representation from the industry, government, and software and instrument vendors. Each participating laboratory received a “kit” with the necessary materials for the study. The kit included four tryptic digests of the NISTmAb: one digest acted as the reference (Reference Sample) against which the other digests were compared for NPD, a second digest contained an additional set of 15 synthetic peptides spiked in to mimic impurities (Spike Sample), and a third digest was prepared from a NISTmAb sample that had first undergone high pH stress (pH Stress Sample) to test the untargeted analysis of changes in PTM levels. The fourth digest (Unknown) was the same sample as the Reference, however the identity was not revealed to the participants and served as a negative control. The kit also included a vial containing 15 synthetic peptides (Calibration Sample) to gauge instrument performance across laboratories and a vial of 0.1% formic acid in water for use as a blank injection to prepare the column.

Each participant followed the same LC-MS method and used the same column for sample analysis, but instruments and software packages for data analysis varied. The Calibration Sample was injected a total of three times, interspersed at the beginning, middle, and end of the queue. Participants were asked to report the retention time, observed mass, and summed extracted ion chromatogram (XIC) area for each Calibration Peptide. Each of the NISTmAb digests was injected twice. The first injection was acquired in MS-only mode to be used for the NPD analysis itself, and the second was acquired in MS-MS mode to be used for confident identification of peptides. Participants were asked to use their standard MAM analysis platforms on these samples and report any peaks detected as new, missing, or changed in abundance in the Spike, pH Stress, and Unknown Samples when compared to the Reference Sample.

Results

Instrument Performance: Calibration Sample. ASTM Standard E691-18 (Standard Practice for Conducting an Interlaboratory Study to Determine the Precision of a Test Method) (ASTM, 2018) was used to evaluate the interlaboratory precision metrics of the retention times, mass accuracy, and fold-change values for each Calibration Peptide. Retention time repeatability standard deviations within each participating lab fell below 0.25 min, while reproducibility standard deviations between laboratories were measured between 1.4 min and 2.0 min. The larger variation in retention times between laboratories was expected due to the use of different LC systems. The high-resolution mass spectrometers used for the study achieved mass accuracy values of less than ±5 ppm, which is within the expected performance range for these instruments and is well within typical mass accuracy tolerance windows for database searching and NPD peak picking. Quantitative performance was measured by calculating the fold-change in abundance for each of the 15 peptides (i.e., ratio of a given peptide XIC from one injection to the XIC of the same peptide in another injection). Since the same volume of Calibration Sample was loaded onto the LC-MS system for each injection, the theoretical fold-change was 1 for each peptide. All but one of the absolute fold-change values for the Calibration Peptides averaged less than 1.26 with reproducibility standard deviations less than 0.35. Together, these performance metrics suggested that the instruments being used across the industry for MAM are performing within expected specifications.

New Peak Detection: Spike and Unknown Samples. Participants performed NPD on the Spike and Unknown Digests and reported any peaks that were new, missing, or changed in abundance (≥five-fold) when compared to the Reference Sample (Figure 2B). For the Spike Digests, fifteen of the participants conformed to expectation by detecting all 15 Spike Peptides as new peaks, with no additional new, missing or changed peaks reported (with the exception of synthetic impurities known to be inherent to the Spike Peptide mixture). Thirteen participants reported false positives (new, missing, or changed peaks that included NISTmAb peptides, unidentified peaks, and system contaminants), false negatives (fewer than fifteen Spike Peptides detected as new peaks), or both. Conformity to expectation for the Unknown Sample was met by sixteen participants who did not report any differences between the Unknown and References Samples. Peaks reported by non-conforming participants included NISTmAb peptides, unidentified peaks, keratin peptides (contaminants from the digestion process), a trypsin peptide, and one participant reporting carry-over of Spike Peptides.

The authors assigned a likely source for 92% of the non-conforming peaks for which a corresponding raw data file was available. This data evaluation showed the false positive peaks to be the result of 1) inadequacy of the column conditioning steps prescribed in the study protocol (causing large retention time shifts for a few NISTmAb peptides between the Reference and other samples and thereby interfering with peak alignment during software processing); 2) sample degradation (clipped peptides unique to four participants seemed to have been generated some time between shipment of the kit to the participants and injection onto the column); 3) system contamination (e.g., plasticizer, trifluoroacetic acid adducts); 4) instrument-induced artifacts (e.g., in-source fragmentation, metal adduction); 5) peak abundance (low signal not well-distinguished from background); and 6) group-wise comparison of all four sample (rather than individual Reference to Sample comparisons; limited to one participant). False negative results (i.e., Spike Peptides not reported as new peaks) were attributed to peak signal falling below the NPD threshold set during the peak picking process (i.e., distinguishing signal arising from true peptide peaks from noise), a value that was set according to each participant’s unique platform parameters.

New Peak Detection: pH Stress Samples. The degraded pH Stress Sample was expected to contain multiple new, missing, or changed peaks but the complexity of this sample did not lend itself to providing the authors with a definitive profile of expected differences to be found when compared to the Reference Sample. To survey the pH Stress Sample results the authors evaluated the consensus between peaks reported as new, missing, or changed by calculating the coincidence frequency (ω^c) (number of participants reporting a given peak) of each unique peak reported across laboratories and plotting the resulting values against the calculated peak coincidence population values [M(ω^c)] (number of peaks with the given coincidence frequency) (Hudgens et al., 2019a). The six highest ω^c values ranged from 26 to 18 participants, each with a corresponding M(ω^c) value of 1 peak (Supplementary Figure S1). There were no peaks achieving the maximum possible ω^c of 28 participants (i.e., no peak was reported by all participants). Processed NPD data files from one participant were available to aid our understanding of the low consensus values attained for the pH Stress Sample. These data showed incidences of new and changed peaks falling just below the participant’s NPD and fold-change thresholds, and co-elution with overlapping mass-to-charge ratios as the likely culprits.

Learnings/Future Perspective

Evaluating the results of this interlaboratory study allowed the authors to gauge the performance of MAM across the industry and provide insights for improving the platform, especially for those in the beginning stages of developing their platforms.

The Calibration Sample provided a broad overview of instrument performance, with the results indicating that the instruments being used for MAM are performing within their expected ranges. Their performance, however, was not predictive of NPD performance in the other samples. The results of participants whose Spike and Unknown Sample analyses did not conform to expectation highlighted the importance of proper sample handling and instrument preparation (i.e., column conditioning or washing) to ensure the integrity of the results. While in these cases the participants rightfully detected the differences between the samples as new peaks, these “nuisance positives” can cost valuable time and resources in a real-world situation as they would necessitate a follow-up investigation. The false positives generated by low abundant ions and the false negatives in the Spike Sample hint that NPD thresholds need to be carefully considered to strengthen the accuracy of the results. Investigation of the low consensus results for the pH Stress Sample also appear to have revealed FCD and NPD threshold settings as the primary source of differing results between participants. Although many participants used similar settings for their MAM evaluation they did not all achieve the same results, indicating that there is no universal threshold that may be applied across all instrument models, software platforms, or even samples.

Perhaps the key takeaway message from this study is that NPD can perform well in identifying new, changed, and missing peaks in the hands of many laboratories. MAM-specific system suitability protocols and criteria that include system performance, sample handling, and NPD/FCD thresholds may improve the success metrics. For example, a product-specific standard spiked with peptides representative of the product’s known process impurities and/or degradants could be run in parallel with the reference and product test samples. Appropriate system suitability sample design would include an empirical determination of appropriate process/product-specific impurity spike peptide quantities and associated NPD and FCD thresholds, considered in conjunction with desired process- and product-specific performance criteria. For example, the thresholds may be set by first finding the threshold value that is low enough to detect all spiked peptides, then continuing to lower that value as far as possible without generating any false positives. That NPD threshold could then be challenged by comparing process/product-specific reference and “unknown” samples and confirming that no new peaks are detected. A passing system suitability result for the optimized sample would require all spiked peptides to be detected as new or changed peaks with no additional peaks reported.

Overall, the MAM NPD interlaboratory study was a valuable way to understand the status of MAM throughout the industry, to identify potential pitfalls and provide guidance to users for improving their MAM methods. By taking the items discussed here into consideration the authors of the study believe that MAM NPD will be found ready for widespread implementation across the industry.