TY - JOUR
T1 - Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset
AU - Tsukuda, Koji
AU - Mano, Shuhei
AU - Yamamoto, Toshimichi
N1 - Funding Information:
The authors would like to express their gratitude to Professor Naruya Saitou who provided part of the data used in this study, and to the referees who provided a lot of insightful comments. This work was partly supported by Japan Society for the Promotion of Science KAKENHI Grant Number 16H02791, 17H04148 and 18K13454. This study was partly carried out when the first author was a member of Biostatistics center, Kurume University.
Publisher Copyright:
© 2020 Walter de Gruyter GmbH, Berlin/Boston 2020.
PY - 2020/4/1
Y1 - 2020/4/1
N2 - Short Tandem Repeats (STRs) are a type of DNA polymorphism. This study considers discriminant analysis to determine the population of test individuals using an STR database containing the lengths of STRs observed at more than one locus. The discriminant method based on the Bayes factor is discussed and an improved method is proposed. The main issues are to develop a method that is relatively robust to sample size imbalance, identify a procedure to select loci, and treat the parameter in the prior distribution. A previous study achieved a classification accuracy of 0.748 for the g-mean (geometric mean of classification accuracies for two populations) and 0.867 for the AUC (area under the receiver operating characteristic curve). We improve the maximum values for the g-mean to 0.830 and the AUC to 0.935. Computer simulations indicate that the previous method is susceptible to sample size imbalance, whereas the proposed method is more robust while achieving almost identical classification accuracy. Furthermore, the results confirm that threshold adjustment is an effective countermeasure to sample size imbalance.
AB - Short Tandem Repeats (STRs) are a type of DNA polymorphism. This study considers discriminant analysis to determine the population of test individuals using an STR database containing the lengths of STRs observed at more than one locus. The discriminant method based on the Bayes factor is discussed and an improved method is proposed. The main issues are to develop a method that is relatively robust to sample size imbalance, identify a procedure to select loci, and treat the parameter in the prior distribution. A previous study achieved a classification accuracy of 0.748 for the g-mean (geometric mean of classification accuracies for two populations) and 0.867 for the AUC (area under the receiver operating characteristic curve). We improve the maximum values for the g-mean to 0.830 and the AUC to 0.935. Computer simulations indicate that the previous method is susceptible to sample size imbalance, whereas the proposed method is more robust while achieving almost identical classification accuracy. Furthermore, the results confirm that threshold adjustment is an effective countermeasure to sample size imbalance.
UR - http://www.scopus.com/inward/record.url?scp=85084822957&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084822957&partnerID=8YFLogxK
U2 - 10.1515/sagmb-2018-0044
DO - 10.1515/sagmb-2018-0044
M3 - Article
C2 - 32364524
AN - SCOPUS:85084822957
SN - 1544-6115
VL - 19
JO - Statistical Applications in Genetics and Molecular Biology
JF - Statistical Applications in Genetics and Molecular Biology
IS - 2
M1 - 20180044
ER -