Study participants
We used data from a multi-ethnic cohort of 19,082 female participants who had treatment in one of eleven clinics geographically distributed across the United Kingdom (nine) and Poland (two) between 2005 and 2023 (Table 2). We selected treatment-naive patients who had a transvaginal ultrasound scan presenting at least three follicles >10 mm on the same day as the DoT administration. This cohort was 18–49 years of age at the time of treatment with a median body mass index (BMI) of 24.17 kg/m2 and antral follicle count of 15.00. Where available, an assessment of oocyte maturity grade detailing metaphase-II oocytes (n = 14,140 patients) was used as the primary outcome (i.e., in ICSI treatment cycles). Downstream outcomes such as the number of 2PN zygotes (n = 17,822 patient cycles), and high-quality blastocysts (n = 17,488 patient cycles), were also assessed. Where data were unavailable in patients electronic health records for specific demographic information or clinical outcomes, these were excluded from the respective analyses and noted in Table 2.
Study approvals
Data included in this manuscript were obtained from a retrospective study carried out following the sponsorship of the institutional review board and approval by the Health Research Authority (23/HRA/2849). All subjects gave written informed consent in accordance with the Declaration of Helsinki and Good Clinical Practice. All ART clinics were under a license from the Human Fertilization and Embryology Authority (UK) or the Ministry of Health (Poland).
In vitro fertilization protocol
This was a retrospective cohort study analyzing follicle and oocyte data from IVF or ICSI cycles. The objective was to identify the follicle sizes on the DoT that are most likely to yield mature oocytes and therefore provide a target for ovarian stimulation and expected oocyte number.
Patients at elevated risk of ovarian hyperstimulation syndrome (OHSS) or who are noted to have premature progesterone elevation are often advised to have their embryos cryopreserved pending a frozen embryo transfer (called “freeze-all”)3. A freeze-all strategy can mitigate the risk of reduced implantation due to premature progesterone elevation, albeit at the expense of increasing the time-to-pregnancy and the risk of perinatal complications, e.g., large-for-gestational-age babies14. For analysis of live birth in this study, women underwent either IVF or ICSI with fresh embryo transfer and had their final ultrasound scan to assess follicle size on the DoT administration. Follicle growth was induced using daily preparations containing FSH activity. Patients underwent a suppressant protocol to prevent premature ovulation using either a gonadotropin-releasing hormone (GnRH) agonist (“long” protocol; n = 6990) or antagonist (“short” protocol; n = 7408) co-treated protocol. Following this, a trigger of either human chorionic gonadotropin (hCG; n = 13,473) or GnRH agonist (n = 1675) was administered generally once two to three follicles had reached 17 or 18mm in diameter (the clinician”s decision).
Only the first recorded IVF cycle of the patient was used in all analyses (i.e., treatment-naive patients). Primary analysis was carried out on patients who had an ultrasound scan on the DoT. Laboratory results, including assessments of oocyte maturity, embryo quality, and blastocyst quality, were included as outcome measures. Subsequent analyses were carried out on patients with ICSI treatment only and ultrasound scans available on the penultimate (DoT-1; n = 10,457) or ante-penultimate (DoT-2; n = 9533) days before the DoT as further methodological validation.
Statistical analysis
Data pre-processing
The ultrasound scans during OS of IVF/ICSI treatment were used to obtain the follicle sizes for each patient. 2D-scanning of the follicles provides a diameter measurement in millimeters, which were recorded to integer precision by ultrasonographers in electronic health records. Follicle diameters of the same size were therefore grouped and counted in 1 mm increments from 6 to 26 mm on the DoT. We used the individual follicle sizes as input variables:
$$ \bfX=X_6,…,X_26\\ \,\rmwhere\quad X_6=\rmnumber \, \rmof \, \rmfollicles \, \rmsized\, 6 \, \rmmm,\,\\ X_7=\,{\rmnumber} \, {\rmof} \, {\rmfollicles}\, {\rmsized}\, 7 \, {\rmmm},\,\\ \,\rmand\, \rmso \, \rmon.$$
(1)
For the outcome measure, the number of all oocytes (yooc), and more specifically MII oocytes retrieved (representing a subset of all oocytes capable of fertilization), were used in the regression models. To normalize the right skewness, the outcome was transformed using the natural logarithm:
$$y_out=\ln (y_ooc+1)$$
(2)
The inverse transformation was carried out (3) during model testing and evaluation. In additional analyses, the number of 2PN zygotes and the number of high-quality blastocysts were used as the outcome measure with the same transformation strategy. Since patients often undergo several ART cycles, we ensured that only the first recorded cycle per patient was included in the dataset, so as to guard against intra-individual correlation introduced by the availability of a longitudinal information pool to inform decision-making in successive treatment attempts31.
$$y_ooc=\exp (y_out)-1$$
(3)
Model development and validation
Several histogram-based gradient boosting regression tree models were trained15. This is an open-source library inspired by LightGBM (Microsoft), with much faster building procedures using histogram data structures. We employed leave-one-clinic-out cross-validation (LOCO-CV) procedure to train, validate, and test the models (so-called “internal-external validation”29). The mean absolute error (MAE) was optimized as an objective function using nested LOCO-CV with Bayesian optimization to tune relevant hyperparameters in the search space demonstrated in Supplementary Table 2. This procedure is where at every LOCO-CV fold, the eleventh (clinic) fold represents an independent test set. Within the other ten folds, the tenth is a validation set for tuning hyperparameters, and the remaining nine are used as a training set under ten-fold cross-validation. The MAE was chosen as it is less sensitive to outliers and intuitively demonstrates the error in the model since the absolute error can be interpreted as a unit oocyte of loss.
Ten independent model pipelines were implemented with various output measures including oocytes, MII oocytes, 2PN zygotes, and high-quality blastocysts. Further model stratifications included age and IVF protocol type (“long” GnRH agonist or “short” antagonist). To ensure that the conclusive follicle size range was not impacted by using a subset of the patient cohort to analyze MII oocytes collected (where maturity grading was available), we trained models to compare the results for all oocytes collected in both cohorts (i.e., 19,092 patients of which 14,140 had ICSI treatment). Furthermore, to investigate the impact of outliers and potential aberrant data, we restricted the dataset in a separate model to cycles with 1–30 MII oocytes retrieved, and where the number of follicles on the DoT was at least equal to the number of MII oocytes retrieved (n = 11,819 patients).
Utilizing the same LOCO-CV procedure, we incorporated further input variables of interest such as age, BMI, days of stimulation, type of IVF protocol (“long” GnRH agonist or “short” GnRH antagonist), estradiol on the DoT, and the type of trigger administered (hCG or GnRH agonist), to observe whether the predictive capability of the mature oocytes model improved. We compared whether the MAE and R2 notably improved to solely using the number of follicle sizes on the DoT as input.
Similarly, to observe any notable impact in the trade-off between model complexity and explainability, we modeled the primary outcome of MII oocytes (n = 14,140 patients) using a multilayer perceptron model (a shallow artificial neural network) and reported its MAE.
Identifying the most contributory follicle sizes
Explainability is a current priority in ART, and clinicians generally prefer to avoid black-box treatment recommendations2,36. Ensemble methods, therefore, offer a valuable trade-off in handling non-linear and complex underlying data, accompanied by explainable insights. Only once each model was trained and validated (using up to ten folds in total), was then the mean and standard deviation determined across five runs of the permutation importance of features using the eleventh independent test set in the LOCO-CV protocol37. In this paradigm, features are randomly shuffled to see their impact on model loss (here set as the MAE). To identify the key follicles that yield mature oocytes, we used a threshold of ≥50% normalized contribution to the model to indicate relative importance; as described in previous literature, we hypothesized this follicle size range of utmost utility to be contiguous5,10. To establish further insights from the data, we also analyzed patients who had a final ultrasound scan on the penultimate (n = 10,457) and ante-penultimate (n = 9,533) days prior to the DoT administration. We used these patient cohorts to identify if a step-wise trajectory in the size of important follicles was observed.
Predictive capabilities and model explainability
The Shapley Additive exPlanations (“SHAP”) package, an alternative explainability method grounded in game theory38, was used to provide a further interpretation perspective and reinforce our findings from the permutation importance analysis. As opposed to observing changes in model loss, the SHAP paradigm considers the coalition of features to estimate the contribution of each feature towards the predicted value, which can be positive or negative, in units of the loss function (i.e., unit MII oocytes). The “TreeSHAP” package is optimized for tree-based models and approximates the marginal expectation of the outcome and the contribution of each feature39.
Separately, to identify which single follicle range was most predictive of mature oocytes, we evaluated the predictive ability of specific size ranges using univariable linear regression with LOCO-CV across the eleven clinics (n = 14,140). Then, to identify which range was most predictive of the number of mature oocytes retrieved, we compared all possible follicle size ranges using the same method, optimized for MAE as the loss function.
Determining improvements in mature oocyte yield
We considered the cohort of patients who had ICSI (n = 14,140) where the number of MII oocytes was recorded to compare the threshold-based criteria currently used in clinical practice, and a proposed approach based on maximizing the proportion of follicles within the optimal size range. For each of the four variations of the typically used threshold-based criteria to determine the DoT administration (i.e., two or three follicles greater than size 17 or 18 mm), we assessed whether each patient cycle had fulfilled this criterion or not, and grouped them accordingly. We compared the relative difference in medians of mature oocyte yield (number of mature oocytes divided by the total follicle count on the DoT) in these two groups of patients and compared the subgroups using the Mann-Whitney U-test (Fig. 4a).
For the range-based criteria, we ran the same analysis at different minimum cut-offs of the percentage of follicles within that follicle size range on the DoT (e.g., ≥5%, ≥10%, and so on), to examine any improvement in mature oocyte yield when maximizing the follicle sizes in this range (Fig. 4b). All statistical comparisons were carried out using the two-sided Mann-Whitney U-test.
Associations with live birth rates
To determine any associations between the proposed follicle size and live birth rate (LBR), we used logistic regression on all data with LBR recorded and input variables (n = 9843) including the percentage of follicles in the proposed follicle size range, total follicle count on the DoT, age at the time of treatment, and type of trigger administered (hCG or GnRH agonist). We then repeated this by replacing the follicle size range input variable with the mean follicle size on the DoT follicle profile. We utilized 100 bootstrapped simulations to determine a 95% confidence interval (CI) for partial dependence (marginal contribution) of each variable, highlighting any statistical significance and their associations with LBR.
Further, we plotted the mean and 95% CI of the mature oocyte yield (n = 646) and LBR (n = 427) according to serum progesterone levels (nmol/L) on the DoT in 1 nmol/L increments (Fig. 5c). We compared serum progesterone on the DoT (n = 994) according to the number of follicles sized larger than the proposed optimal follicle size range (Fig. 5d). Progesterone levels were compared to those with less than two larger follicles using the Dunnett’s multiple comparison test with adjusted p values.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
link