Deep learning for oncologic treatment outcomes and endpoints evaluation from CT scans in liver cancer
Cohorts construction and ethics
Cohort A has the official title of “RATIONALE-208: A Phase 2, Open-label, Multicenter Study to Investigate the Efficacy, Safety, and Pharmacokinetics of the Anti-PD-1 Monoclonal Antibody BGB-A317 in Patients with Previously Treated Hepatocellular Unresectable Carcinoma” (NCT03419897). Cohort B has the official title of “A Single-Arm, Multicenter Phase 2 Study of BGB-A317 in Patients with Previously Treated PD-L1+ Locally Advanced or Metastatic Urothelial Bladder Cancer” (NCT04004221). Detailed information is available at Cohort C was recruited from XinHua Hospital Affiliated to Shanghai Jiao Tong University. The study data were fully de-identified, and the patient’s confidential information was removed. Therefore, the study was considered exempt from informed consent of the study participants. Cohort C consists of liver cancer patients treated with lenvatinib, bevacizumab, camrelizumab, or tislelizumab from 2017 to 2022. Data in Cohort A and B passed ethics review (application number: I2021173I) and were approved by the Human Genetic Resource Administration of China (approval number: [2021] GH5565). Data in Cohort C was approved by the Ethics Committee of XinHua Hospital (Approval No. XHEC-D-2023-081). The study was conducted in accordance with the Declaration of Helsinki.
Image processing
All patients underwent multi-slice helical CT scans to get portal venous phase scan using various scanners. For Cohort A and B, detailed scanner types include Philips Brilliance ICT 256/64, GE Lightspeed VCT, Discovery CT750 HD, Ingenuity CT, Revolution HD, Revolution EVO/GSI, Optima CT 540/660/680, Aquilion Prime, Aquilion One, BrightSpeed, uCT 530/550/760/780, and IQon Spectral CT. The liver was imaged using tri-phasic scans (i.e., late arterial phase, portal venous phase, and delayed/equilibrium phase). The methodology was consistent across visits for a subject (i.e., phases acquired, timing for each phase, etc). Scanning details were set by each separate site. For cohort C, CT scans were performed with a 64-channel MDCT (Somatom Definition, Siemens, Forchheim, Germany). Each patient fasted for more than four hours before undergoing CT scanning. The CT parameters were as follows: detector collimation (1 mm), pitch (0.9), gantry rotation (0.5), tube voltage (120 kV), tube current (240 mA), matrix (512 × 512), slice thickness (5 mm), and reconstruction interval (1 mm). The portal venous phase scanning was obtained using a fixed 65 s equilibrium following intravenous injection of 100 mL of ionic contrast material at a rate of 3 mL/s using an automatic injector and scan ranging from diaphragm to iliac crest.
We processed raw DICOM series to NIFTI format, normalized to [−200, 200] and resampled using SimpleITK40 in Python. The original resolutions varied and were resampled to 0.8 mm × 0.8 mm × 5 mm. Data preprocessing is shown in Supplementary Fig. 14.
The ground truth 3D segmentation of each CT scan is blindly performed by two oncologists (both had 3 years of experience) using the ITK-SNAP software (version 3.8.2). Only liver lesions were included and annotated. New lesions on follow-up scans were extra labeled (with tumor masks of longitudinal lesions being marked as 1, and new lesions marked as 2). During the annotation process, the two oncologists considered the patient’s baseline condition and treatment history. All cases will be handed over to a third oncologist with 11 years of experience. The third oncologist primarily selects one of the results from the two oncologists. If necessary, this oncologist will modify (re-annotate) the mask based on one of the segmentation results. The treatment outcome assessment results are derived from the annotated tumor masks (see ‘Definition of objective response outcomes’). All treatment response outcomes will be reviewed by a fourth medical imaging processing expert based on 3D CT scan images of the liver region.
Intra-rater variability was assessed. The DICE coefficient between the initial segmentations of the two assessors was 0.702 [SD 0.233]. The Fleiss Intraclass Correlation Coefficient41 (ICC [2,1]) for tumor volumes initially segmented by the two assessors was 0.924 (95% CI: 0.871, 0.964). For the treatment response outcome between the two initial assessors, ICC was 0.708 (95% CI: 0.662, 0.745), the ordinal Krippendorff’s alpha42 was 0.785 (95% CI: 0.750, 0.820). After the third rater selected the more appropriate segmentation results, the treatment outcome results were calculated. The fourth rater had checked the treatment outcome results and reported no adjustment. The agreements between the final treatment assessment results and the initial results from the first two raters were also calculated. For the first rater, the ICC was 0.826 (95% CI: 0.794, 0.858), and Krippendorff’s alpha was 0.880 (95% CI: 0.853, 0.904). For the second rater, the ICC was 0.909 (95% CI: 0.885, 0.931), and Krippendorff’s alpha was 0.939 (95% CI: 0.921, 0.955).
Training data split
For model training and internal validation, we used five-fold cross-validation. We randomly divided cohort A1 into five folds by patients. There are four folds of 22 patients with 116, 120, 121, and 109 CT scans, respectively, and one-fold with 116 CT scans of 23 patients.
Definition of objective response outcomes
The treatment outcome can be assessed by comparing each follow-up scan to the baseline scan. The outcome evaluation considers both the change in tumor burden and the appearance of any new lesions. The sum of the volume (SOV) is used to reflect the change in tumor burden. A mapping function is designated to transform the SOV of liver lesions in both the baseline and follow-up into the probability of three response outcome classes.
The designed mapping function can be written as a combination of relative SOV change and absolute SOV change. The ground truth of SOV-based tumor status classification was generated by \(\rm\arg \max (P_PR,P_SD,P_PD)^T\). We visualized the mapping function by plotting a series of baseline and follow-up SOVs to show the change in three-class probability (Supplementary Fig. 15).
$$\left(\beginarraycp_PR\\ p_SD\\ p_PD\endarray\right)=w\,\cdot\,\left(\beginarrayc\sigma \left(C_1\,\cdot\,\left(r_PR-\tfracV_2V_1\right)\right)\\ \sigma \left(C_1\,\cdot\,\left(r_PD-\tfracV_2V_1\right)\right)-\sigma \left(C_1\,\cdot\,\left(r_PR-\tfracV_2V_1\right)\right)\\ 1-\sigma \left(C_1\,\cdot\,\left(r_PD-\tfracV_2V_1\right)\right)\endarray\right)+\,\left(1-w\right)\,\cdot\,\left(0,1,0\right)^T$$
(1)
where,
-
\(r_PR\), \(r_PD\) is extrapolated from 1D RECIST to 3D according to our dataset, where we set \(r_PR=0.65\), \(r_PD=1.44\). The calculation of these two critical values is described in “Validations on the designed mapping function”.
-
\(C_i\) is a constant, where we set \(C_1=15\), \(C_2=5\), \(C_3=0.1\);
-
\(\sigma\) is the Sigmoid function, which maps \(\left[-\infty ,+\infty \right]\) to [0, 1];
-
\(V_1\) is the baseline SOV of all liver lesions, \(V_2\) is the follow-up SOV of all liver lesions;
-
\(w\) is the trade-off between relative and absolute volume change, \(w=\sigma \left(C_2\cdot \log \left(\tfrac\leftV\right)+C_3\cdot \max \left(\tfracV_2V,\,\tfracV_1V\right)\right)\), which works only if both the baseline and follow-up lesions are small. \(V\) is the absolute volume required for progression, where we set \(V=400\). The two items are in line with the RECIST guideline, in which a change in diameter of less than 5 mm would be deemed stable disease;
-
When the SOV is not particularly small (both SOV of baseline and follow are smaller than ~1 cm3), the contribution of the second term can be neglected. The argmax of the three-class probability vector should be equivalent to: a 44% increase indicates progression, and a 35% decrease indicates response.
New lesions were evaluated by radiologists. Ground truth for the treatment response evaluation integrated the result of both tumor burden classification and new lesion identification, with the rule shown in Supplementary Table 1.
Validations on the designed mapping function
The mapping function we designed referenced the existing 3D volumetric lesion measurement methods. In the existing literature43,44, there are two different thresholds for progression and response evaluation—one assuming the lesions are spherical (in which case, progression is defined as a 73% or more increase in volume, and response is defined as a 65% or more decrease in volume), and the other assuming the lesions are ellipsoidal (in which case, progression is defined as a 20% or more increase in volume, and response is defined as a 35% or more decrease in volume). However, in actual data, lesions may not ideally be spherical or ellipsoidal, so using any one of the two thresholds could introduce potential biases. Consequently, we aimed to establish the critical volume change thresholds for progression and response by calculating the average volume ratio change for a single lesion when its diameter increased by 20% or decreased by 30% in our dataset. Cohort A was conducted with manual diameter measurements and was therefore used for mapping function modeling and validation. Specifically, we selected the single lesion in Cohort A1 that had a diameter reduction of 25–35%, and calculated the average volume change ratio of 0.6508 ± 0.1051. Similarly, lesions with a diameter increase of 15–25% yielded an average volume change ratio of 1.4403 ± 0.2282. Validation results in Cohort A2 showed that the average volume change ratio was 0.6398 ± 0.0531, and 1.4639 ± 0.1379 for response and progression, respectively. The similar volume change rates of Cohorts A2 with A1 indicate the potential generalizability of the critical value we have set.
With selected thresholds for progression and response, our mapping function has been further designed with smoothing for the three-class classification problem. In the scenario where the tumor volume increases by 44%, the probabilities of stability and progression are exactly equal. When the tumor volume increases by more than 44%, the treatment evaluation result will be disease progression. The same logic applies for the response.
We compared automatically calculated response outcomes from human-annotated tumor masks to fully human-assessed RECIST v1.1 on Cohort A. The two measures may have potential differences because the former considered all tumors while the latter considered the two largest tumors. We have calculated the consensus of the two measures: The automatically calculated response achieved 81.2% consensus with the RECIST v1.1 for response outcomes (Supplementary Fig. 16).
RECORD architecture
The tumor segmentation step adopts both CNN-based nnU-Net and ViT-based Swin-Unetr
For deformable registration, we re-implemented Advanced Normalization Tools (ANTs)45 in PyTorch to make deformable transformations differentiable. Both forward and inverse displacement fields for follow-up images were generated by ANTs. Since we are to evaluate tumor burden based on volume, the inverse consistency of the registration process on volume was verified (Supplementary Fig. 17).
The workflow for difference map generation is described in Supplementary Algorithm 2. After registration of the predicted masks, we subtracted the baseline mask from the follow-up mask, and obtained differences on single lesion level by 3D connected component analysis. Then, we marked any extremely large differences as potential false positives and checked whether they represented actual disease progression or model mispredictions, using the correlated information from other serial CT scans. In addition, a liver mask predicted by the union of “livermask”46 and pretrained nnU-Net on abdominal organ segmentation task47 was added to remove lesions located outside the liver, further guiding the optimization model to learn relevant features from the original predicted tumor mask. The following paragraph provides more implementation details.
The difference map generation involves three key steps: checking suspected abnormal large lesions (Supplementary Algorithm 2, line 11–14), checking abnormal growth trend (Supplementary Algorithm 2, line 15–24), and removing lesions outside the liver (Supplementary Algorithm 2, line 27). The first step of checking abnormally large lesions is a rough selection of lesions that have drastically changed between the baseline and follow-up CT scans. To distinguish whether this is natural disease progression or a misprediction by the segmentation models, the variation between the predicted second-order lesion masks is then calculated. Second-order lesion masks refer to the differential of the differences between predicted tumor masks at two consecutive time points. Larger variation of a particular lesion is more likely to indicate a misprediction, as tumors have a low probability of shrinking or expanding significantly in a short period. A lesion with extremely large second-order variation would be removed on the difference map. Finally, the liver mask is used to remove any predicted lesions that are located outside the liver. The thresholds for these steps are set to be excessive, in order to retain as much original information as possible and only remove extreme cases of misleading segmentation predictions.
The SOV-based treatment response classification model was built on a multi-task learning framework that considered both the accuracy of lesion segmentation and progression classification (Fig. 3B, upper branch, blue dotted line, \(L_dice\) and \(L_progression\)). The whole procedure was described in Supplementary Algorithm 1. The model received inputs from the baseline and follow-up tumor masks predicted by the segmentation model (i.e., CNN-based nnU-Net or ViT-based Swin-Unetr) as well as the difference map generated by the predicted masks. Then, the model took the concatenation of the baseline mask, the generated difference map, and the registered follow-up mask as input, and fed it into an encoder-decoder architecture with two output heads – a segmentation head and a classification head. For the segmentation head, traditional DICE and cross-entropy loss was used to ensure accurate tumor segmentations. For the classification head, an ordinal regression approach was used to transform the SOV of baseline and follow-up tumor masks into a three-class probability vector. During the decision-making stage of the model, the argmax of the predicted probability vector was used to determine whether the treatment outcome predicted by the model was response, stability, or progression. For the encoder-decoder structure, we utilized the Squeeze and Excitation Unet (Se-Unet). Se-Unet incorporated squeeze and excitation (SE) blocks48 into the 3D U-Net49 architecture. To balance computational complexity and avoid overfitting, three encoder-decoder blocks and four bottleneck blocks were used.
For multi-task learning, the model was optimized with gradients for both accurate segmentation and classification. We designed the loss function to be a weighted sum of the segmentation loss (DICE loss and cross-entropy loss) and the progression loss (ordinal regression).
$$L=L_seg+w\,\cdot\,L_progression,w=0.55$$
(2)
$$L_seg=L_seg\_baseline+L_seg\_followup$$
(3)
$$L_progression=-y\,\cdot\,ln\haty,\haty=\frace^-x_i\mathop\sum \nolimits_k=1^3e^-x_ki=1,2,3$$
(4)
where \(w\) is the trade-off between two tasks. Based on our experiments, we recommend setting \(w\) between 0.4 and 0.7, and found that \(w=0.55\) achieve the best performance. \(L_seg\) represents the segmentation loss for both the baseline and follow-up images. \(L_progression\) represents the progression loss, which depicts the classification accuracy of response evaluations. It is crucial to highlight that the ground truth label y is designed to be continuous in the model optimization, without employing argmax that is typically utilized in traditional classification problems. This is because there exists an order among response, stable, and progressive, thus regarding this as an ordinal regression problem, rather than a traditional classification problem, can give more tolerance to corner cases.
The SOV-based treatment response classification model was implemented using MONAI 0.9.1, a PyTorch-based repository. To handle the computation load of 3D image pairings, we adopted data parallel, used sliding window strategies to distribute ROI patches computation to four GPUs.
Explanations on the design of ordinal regression-based loss \(\boldsymbolL_{\boldsymbolprogression}\)
The loss function of response outcome classification is based on ordinal regression, which is modified from the traditional categorical classification to in line with the continuous disease progression process. Traditional cross-entropy loss treats the ground truth label as three one-hot classes, while our modified loss function converts the discrete three-class label into a probability vector. The nature of treatment outcome is a continuous quantification of change ratio of tumor burden, rather than a discrete value determined by several cut-offs. Specifically, thresholds between disease statuses are manually set to get the results from the baseline-follow-up SOV change ratio, with a given percentage of SOV decrease denotes PR (in our scenario is 35%), and another given percentage increase in SOV denotes PD (in our scenario is 44%). One-hot discrete labels can have information loss. For instance, clinicians would consider 43% and 45% increase in SOV have little difference in disease progression status. However, if we adopt categorical classification during the optimization, the former would be one-hot encoded with (0, 1, 0) (SD), while the latter would be (0, 0, 1) (PD), leading to totally different ground truth response outcomes. The simplest way to avoid information loss is to directly predict the continuous tumor burden change ratio. Whereas, it is also inappropriate to treat the problem as a traditional regression task that employs a standard mean squared error (MSE) loss, since the tumor burden change ratio exhibits boundary effects. For example, extreme change ratios such as 20 versus 30 times greater may both be indicative of disease progression, yet they can cause large loss during model optimization when using MSE as the loss function. Conversely, small change ratios such as 0.5 (shrinkage) and two times (expansion) can result in vastly different response evaluations, but their contribution to the loss would be smaller than that of the former scenario. To precisely sum up, we carefully designed a mapping between SOV change ratio and the three-class probability vector, fully utilizing label information while matching the practical response evaluation scenario.
Longitudinal graph matching in the new lesion assessment
Graph matching was used to model the new lesion identification process in the RECORD model. The generated difference map was added to the registered masks to remove possible erroneous predictions. Subsequently, 3D connected component analysis was employed to obtain the individual lesions. Then, a longitudinal lesion graph \(G=(L_B,\,L_F,E)\) was constructed, where \(L_B=\l_1^B,l_2^B,\cdots ,l_n^B\\) and \(L_F=\left\l_1^F,l_2^F,\cdots ,l_n^F\right\\) represented the baseline (B) and follow-up (F) lesions, respectively; E represented the edge between the baseline and follow-up lesions. In the new lesion assessment, we need to avoid erroneous positive diagnoses caused by lesion splitting or merging, especially in cases where lesion splitting is present. We assume that split or merged lesions are relatively close located. Therefore, during the graph construction, the relative Euclidean distance \(c_ij=\fracd_ijr_i^B+r_j^F\) between baseline and follow-up lesions was calculated, where \(d_ij\) represented the center-to-center distance between two lesions, and \(r_i^B\), \(r_j^F\) represented the radii of the baseline and the follow-up lesions, respectively. When \(c_ij < 1.0\), an edge \(e_ij=(l_i^B,\,l_j^F)\) was added to \(E\). If all \(l_j^F\) in \(L_F\) had edge, there should be no new lesion assessed by the graph matching, otherwise there exist new lesion(s). Schematic representations are provided in Supplementary Fig. 18.
Model configurations
All the deep learning models were trained from scratch using random initialization. The tumor segmentation models, nnU-Net and Swin-Unetr, were trained on an NVIDIA A100 (40GB). The patch size for nnU-Net was adaptively learned based on the dataset characteristics, and for Swin-Unetr the patch size was 96 × 96 × 96. Both the nnU-Net and Swin-Unetr models were trained for 1000 epochs.
As for the RECORD model, 4 × NVIDIA A100 (40GB) GPUs were used to optimize the model at 3D full resolution for CT inputs. The learning rate was adaptively tuned throughout the training process. The initial learning rate was set to 1e−4, and then a cosine annealing learning rate strategy (the CosineAnnealingLR method in PyTorch) was employed to allow the learning rate to change periodically. The RECORD model was trained for 200 epochs. Minimum loss on the validation set was used to select the model.
Model ensemble
The five trained models from the 5-fold cross-validation on Cohort A1 were ensembled, and then used for inference on Cohort A2, B, and C. For segmentation, we averaged the output probability maps from the five models. For treatment outcome classification, the ensembled predictions were generated by averaging the three-class probabilities predicted by the five trained models. The class with the highest averaged probability was served as the final treatment assessment result.
Statistical analyses
The treatment outcome results for Cohort A1 were obtained by merging all five-fold cross-validation samples predicted by five independently trained models. The ensemble predictions of the models trained in five-fold cross-validation on Cohort A1 were employed as results for the cohorts A2, B, and C. For the SOV-based response classification, the area under the receiver operating characteristic curve (AUROC) was calculated for each class as well as on a micro and macro level. The 95% confidence interval (CI) of the AUROC for each class was calculated using the DeLong method, while the 95% CI of the micro and macro AUROC was calculated using bootstrapping. In addition, the F1-score was calculated for each class using a one-vs-rest strategy. The overall accuracy (which equals to F1-micro) was also reported. For the new lesion assessments, precision, recall, F1-score, and accuracy were calculated. For the overall treatment outcome evaluations, the F1-PR, F1-SD, F1-PD were also calculated using a one-vs-rest strategy, and the overall accuracy was reported. Calculation of the 95% CI of the F1-scores used the Wilson method50. Significance testing of the AUC curves and F1-scores used the DeLong test51 and the McNemar mid-p test52, respectively.
For the endpoints of PFS and response time, we compared the measurements from four different approaches: the segmentation model, the RECORD model, tumor mask annotations from clinicians, and the manually assessed RECIST v1.1 results. All four methods were able to provide evaluations of the patient’s disease status (response, stabilization, or progression) at each follow-up time point. The PFS and time to response were then determined based on these treatment response assessment results. PFS was defined as the time from the initial baseline assessment (when a patient underwent a CT scan and started the first treatment) to either the first observed progression of disease on a CT scan or the time of death, whichever occurs first. Time to response was defined as the duration from the baseline evaluation to the first observed response state observed on a CT scan. To quantify the agreement between PFS and response time provided by different measures, the concordance index (c-index)53 and Spearman’s correlation extended for survival data54 were employed to analyze correlations in bivariate right-censored data.
For the analyses of clinical endpoints (PFS and response time) as surrogates for OS, landmarking survival analyses55 were conducted. For PFS, landmarks were chosen the progression status at 3 months and 6 months from entry. For time to response, a specific landmarking time was not selected due to the limited number of early responses, instead by the response status during the treatment. The disease status at specific landmarking time of each patient was served as a covariate for the Cox model for analysis. A larger hazard ratio (HR) and higher area between the survival curves (ABS) indicates the effectiveness of PFS/response time as predictor(s) for OS.
Cohort A (comprising the five-fold validation results from Cohort A1 and test results from Cohort A2) was analyzed for both clinical endpoint and OS stratification evaluations, given that it comprises an adequate and appropriate number of samples (150 patients) with for the purpose of conducting survival analyses. Furthermore, all patients in this cohort had primary hepatocellular carcinoma and received the same treatment, thereby ensuring the uniformity of the samples.
Calculations for AUROC, precision, recall, F1-score, and accuracy used sklearn 0.22.2 in Python 3.7.13. The agreement measures for bivariate right-censored data used survSpearman 1.0.1 in R 4.0.2. The landmarking analyses used survival 3.1.12 and survminer 0.4.9 in R 4.0.2.
link