Resumen de Clasificación de noticias criminales basada en procesamiento del lenguaje natural y algoritmos de aprendizaje automático

Camilo Ernesto Sarmiento Torres, Néstor Diaz, Rubiel Vargas Cañas

español
Camilo Ernesto Sarmiento Torres
English
In this work, a classification system of criminal news was developed from different digital press media, supported by natural language processing techniques and machine learning algorithms. Initially, a criminal news data set was constructed where eight types of crime were identified. Subsequently, the documents were pre-processed, the stop words were eliminated, a lemmatization was applied, and a representation of the documents with the bag of words model, where the coefficient of term frequency-inverse document frequency (tf-idf) was estimated.

In addition, eight-word dictionaries were built according to the types of crimes and implemented to estimate the performance of five supervised classification algorithms. The random forest algorithm obtained the best performance with 97.22% of accuracy, 98.36% of precision, 98.35% of sensitivity, F1 score of 98.32%, and MCC of 0.97% in the test performed.

Mi Ágora