A Comparative Study of Automatic Hand Bone Age Assessment Systems
- Nov. 2017
- by Ji Hoon Kim et. al.
The bone age assessment is a critical procedure in pediatric radiology for diagnosis of many disorders and assessment of response to treatment. It can be performed by either Greulich and Pyle(GP) or Tanner-Whitehouse(TW) method. Although GP method is convenient to use, it can lead to subjective results. To overcome these limitations, there have been a few trials for automated bone age assessment including commercialized system. To verify the validity of these attempts, we developed Korean based bone age assessment system based on convolutional neural network (CNN) and compared our system with the commercialized system(CS) and Harvard's system(HS).
METHOD AND MATERIALS
A total of 18,940 X-ray images collected from Asan Medical Center(AMC) in Seoul from 2012 to 2015 were used for model training and validation. 90% of the data were used for training and 10% were used for verification. Using the validation data set, we compared automated bone age assessment to manual bone age assessment for validation of automated system's performance. The patient's left-hand X-rays had been read by radiologists from AMC and we define these results as ManBA. Also, bone age assessment for same X-rays was done by automated system and we define these results as AutomatedBA. We assessed automated system's accuracy through mean difference and the standard deviation between the methods. In addition, we analyzed the agreement between two methods with root mean square error (RMSE).
For validation data set, the mean difference between AutomatedBA and ManBA was -0.01 (Boy:-0.01, Girl:-0.01) years and the standard deviation of the difference was 0.94 (Boy:1.15, Girl:0.79) years. The RMSE between two methods was 0.94 (Boy:1.15, Girl:0.77). In case of CS, the mean difference and standard deviation were -0.19 years and 0.76 years, respectively. The RMSE of HS was 0.82 years for boy and 0.93 years for girl.
In terms of RMSE, our system using only Korean data did not show much difference from the existing systems using multi-racial data. And deep learning system trained with GP based data is not significantly affected by the racial differences.
Our system showed different performance according to gender, in particular, there was a contradiction to the Harvard study in gender-specific differences. It implies that additional researches are required for both studies.
Ji Hoon Kim, Hyun-Jun Kim, Kyu-Hwan Jung, Ilji Choi, Sangki Kim, Yeha Lee, Woo Hyun Shim, Jin Seong Lee, Hee Mang Yoon