Camilo Ernesto Sarmiento Torres, Néstor Diaz, Rubiel Vargas Cañas
Camilo Ernesto Sarmiento Torres
In this work, a classification system of criminal news was developed from different digital press media, supported by natural language processing techniques and machine learning algorithms. Initially, a criminal news data set was constructed where eight types of crime were identified. Subsequently, the documents were pre-processed, the stop words were eliminated, a lemmatization was applied, and a representation of the documents with the bag of words model, where the coefficient of term frequency-inverse document frequency (tf-idf) was estimated.
In addition, eight-word dictionaries were built according to the types of crimes and implemented to estimate the performance of five supervised classification algorithms. The random forest algorithm obtained the best performance with 97.22% of accuracy, 98.36% of precision, 98.35% of sensitivity, F1 score of 98.32%, and MCC of 0.97% in the test performed.