diff --git a/jose.00223/10.21105.jose.00223.crossref.xml b/jose.00223/10.21105.jose.00223.crossref.xml new file mode 100644 index 0000000..956b142 --- /dev/null +++ b/jose.00223/10.21105.jose.00223.crossref.xml @@ -0,0 +1,169 @@ + + + + 20241226183959-8ab335f81c08fe580800107b24fe6e1dab058df6 + 20241226183959 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Education + JOSE + 2577-3569 + + 10.21105/jose + https://jose.theoj.org + + + + + 12 + 2024 + + + 7 + + 82 + + + + An R Companion for Introduction to Data Mining + + + + Michael + Hahsler + + Department of Computer Science, Southern Methodist University, USA + + https://orcid.org/0000-0003-2716-1405 + + + + 12 + 26 + 2024 + + + 223 + + + 10.21105/jose.00223 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.6084/m9.figshare.26750404.v3 + + + GitHub review issue + https://github.com/openjournals/jose-reviews/issues/223 + + + + 10.21105/jose.00223 + https://jose.theoj.org/papers/10.21105/jose.00223 + + + https://jose.theoj.org/papers/10.21105/jose.00223.pdf + + + + + + Introduction to data mining + Tan + 978-0133128901 + 2017 + Tan, P.-N., Steinbach, M. S., Karpatne, A., & Kumar, V. (2017). Introduction to data mining (2nd Edition). Pearson. ISBN: 978-0133128901 + + + arules – A computational environment for mining association rules and frequent item sets + Hahsler + Journal of Statistical Software + 15 + 14 + 10.18637/jss.v014.i15 + 1548-7660 + 2005 + Hahsler, M., Grün, B., & Hornik, K. (2005). arules – A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1–25. https://doi.org/10.18637/jss.v014.i15 + + + Welcome to the tidyverse + Wickham + Journal of Open Source Software + 43 + 4 + 10.21105/joss.01686 + 2019 + Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686 + + + dbscan: Fast density-based clustering with R + Hahsler + Journal of Statistical Software + 1 + 91 + 10.18637/jss.v091.i01 + 2019 + Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1), 1–30. https://doi.org/10.18637/jss.v091.i01 + + + Getting things in order: An introduction to the R package seriation + Hahsler + Journal of Statistical Software + 3 + 25 + 10.18637/jss.v025.i03 + 1548-7660 + 2008 + Hahsler, M., Hornik, K., & Buchta, C. (2008). Getting things in order: An introduction to the R package seriation. Journal of Statistical Software, 25(3), 1–34. https://doi.org/10.18637/jss.v025.i03 + + + arulesViz: Interactive visualization of association rules with R + Hahsler + R Journal + 2 + 9 + 10.32614/RJ-2017-047 + 2073-4859 + 2017 + Hahsler, M. (2017). arulesViz: Interactive visualization of association rules with R. R Journal, 9(2), 163–175. https://doi.org/10.32614/RJ-2017-047 + + + Building predictive models in R using the caret package + Kuhn + Journal of Statistical Software + 5 + 28 + 10.18637/jss.v028.i05 + 2008 + Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05 + + + Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. + Kuhn + 10.32614/CRAN.package.tidymodels + 2020 + Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. https://doi.org/10.32614/CRAN.package.tidymodels + + + + + + diff --git a/jose.00223/10.21105.jose.00223.pdf b/jose.00223/10.21105.jose.00223.pdf new file mode 100644 index 0000000..9c8aec6 Binary files /dev/null and b/jose.00223/10.21105.jose.00223.pdf differ diff --git a/jose.00223/paper.jats/10.21105.jose.00223.jats b/jose.00223/paper.jats/10.21105.jose.00223.jats new file mode 100644 index 0000000..1ff4d1e --- /dev/null +++ b/jose.00223/paper.jats/10.21105.jose.00223.jats @@ -0,0 +1,361 @@ + + +
+ + + + +Journal of Open Source Education +JOSE + +2577-3569 + +Open Journals + + + +223 +10.21105/jose.00223 + +An R Companion for Introduction to Data +Mining + + + +https://orcid.org/0000-0003-2716-1405 + +Hahsler +Michael + + + + + +Department of Computer Science, Southern Methodist +University, USA + + + + +30 +5 +2023 + +7 +82 +223 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2024 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +R +data mining + + + + + + Summary +

An R Companion for Introduction to Data Mining is an open-source + learning and teaching resource that covers how to implement data + mining concepts using R. It is designed to accompany the popular data + mining textbook Introduction to Data Mining + (Tan et + al., 2017) to study the implementation of the basic data mining + concepts including data preparation, classification, clustering, and + association analysis. The resource uses complete, annotated examples + to demonstrate how data mining concepts can be translated into R + code.

+

The materials have been made publicly available at: + https://github.com/mhahsler/Introduction_to_Data_Mining_R_Examples + and licensed under the + Creative + Commons Attribution 4.0 (CC BY 4.0) License.

+
+ + Statement of Need +

The textbook Introduction to Data Mining + (Tan et + al., 2017) has been one of the most popular choices to learn + and teach data mining concepts. Several chapters have been made + available for free by the authors on the + books’s + website. One of the authors also provides Python Jupyter + notebooks with examples, but complete R code examples were still + needed. Given the R community’s interest in data analysis, data + science, and machine learning, and the broad support of R packages for + data mining, there was a noticeable gap that was filled by this + learning resource. This resource targets advanced undergraduate and + graduate students and can be used as a component for a first + introduction to data mining.

+
+ + Learning Objectives and Content +

The resource assumes basic knowledge of programming and statistics. + The learning objectives are to:

+ + +

prepare and understand data,

+
+ +

perform classification,

+
+ +

perform association analysis, and

+
+ +

perform cluster analysis.

+
+
+

The resource presents self-contained and annotated R code examples + that work with small datasets carefully chosen to show the learner + many important aspects of data mining. The learner can copy and paste + the examples into a new R markdown notebook to experiment with the + code and the provided example data. Small exercises encourage the + learner to modify the code by applying it to a different dataset. This + learning-by-doing approach has worked well in preparing students to + work with more complex real-world datasets by initially relieving them + from dealing with too many low-level implementation details while + exploring the concepts.

+

The resource mirrors the textbook’s structure so it can be used + along with it easily. After a short introduction, Chapter 2 discusses + data types in R, data quality concerns, and data preprocessing. Data + exploration and visualization examples are included. Chapters 3 and 4 + cover classification methods, model selection, model evaluation, + different types of classifiers, and essential practical issues like + class imbalance. Chapter 5 introduces association analysis with a + strong emphasis on visualization. Chapter 7 presents examples of + cluster analysis, including popular algorithms, cluster evaluation, + and the effect of outliers.

+
+ + Instructional Design +

This resource does not replace the Introduction to Data + Mining textbook or instruction by a teacher. It instead + provides supporting material for learning to implement data mining + concepts in R. The learner is expected to have some programming + experience and basic statistics knowledge.

+

The resource can be used for self-study by any interested person + together with reading the Introduction to Data Mining + textbook, but its main purpose is to be used as a component for + designing an introductory data mining course for advanced + undergraduate or graduate students. To support instructors, in + addition to the documented code examples, complete presentation slide + sets are provided on the book’s GitHub page in PDF and PowerPoint + format. The slides are organized in the same way as the resource. A + direct connection between the slides and the code examples is provided + by the R symbol on the slides where example code is available. The + code examples can be assigned to be studied by the students outside of + class or used by the instructor in class.

+

Designing assignments and assessments is left to the instructor + since they depend on the level and field of study of the students + (e.g., computer science, statistic, economics, or business). For + example, for undergraduates, we suggest to ask the students to apply + the data mining techniques to a small, clean instructional data set + (sample exercises are available in the resource at the end of each + chapter), while graduate students may be asked to analyze larger + real-world data sets, which may require a significant amount of + cleaning and preprocessing.

+
+ + Story of the Project +

Since starting to teach data mining with R in the Spring of 2013, I + have been developing the Companion for Introduction to Data Mining + resource mainly based on caret + (Kuhn, + 2008), and a set of packages developed with students to better + support different data mining tasks (e.g., arules + (Hahsler + et al., 2005), seriation + (Hahsler + et al., 2008) arulesViz + (Hahsler, + 2017), and dbscan + (Hahsler + et al., 2019)). The resource grew from a collection of short, + unconnected R scripts to a complete set of documented code examples + that walk the learner step-by-step through how to implement data + mining methods, and how to interpret the results. It went through an + update to incorporate the popular tidyverse package collection + (Wickham + et al., 2019) and a transition from the 1st edition of the + Introduction to Data Mining textbook to the second.

+

The companion resource has been used successfully in the department + of Computer Science at Southern Methodist University for many years + and by several instructors as a key component of an introductory data + mining course delivered in person and in a distance education setting. + It is also linked on the textbook website as an official resource. + Faculty at the department actively maintains the resource, and we will + update it with new R tools like tidymodels + (Kuhn + & Wickham, 2020) over time.

+
+ + + + + + + + TanPang-Ning + SteinbachMichael S. + KarpatneAnuj + KumarVipin + + Introduction to data mining + Pearson + 2017 + 2nd Edition + 978-0133128901 + https://www-users.cs.umn.edu/~kumar001/dmbook + + + + + + HahslerMichael + GrünBettina + HornikKurt + + arules – A computational environment for mining association rules and frequent item sets + Journal of Statistical Software + 2005 + 14 + 15 + 1548-7660 + 10.18637/jss.v014.i15 + 1 + 25 + + + + + + WickhamHadley + AverickMara + BryanJennifer + ChangWinston + McGowanLucy D’Agostino + FrançoisRomain + GrolemundGarrett + HayesAlex + HenryLionel + HesterJim + KuhnMax + PedersenThomas Lin + MillerEvan + BacheStephan Milton + MüllerKirill + OomsJeroen + RobinsonDavid + SeidelDana Paige + SpinuVitalie + TakahashiKohske + VaughanDavis + WilkeClaus + WooKara + YutaniHiroaki + + Welcome to the tidyverse + Journal of Open Source Software + 2019 + 4 + 43 + 10.21105/joss.01686 + 1686 + + + + + + + HahslerMichael + PiekenbrockMatthew + DoranDerek + + dbscan: Fast density-based clustering with R + Journal of Statistical Software + 2019 + 91 + 1 + 10.18637/jss.v091.i01 + 1 + 30 + + + + + + HahslerMichael + HornikKurt + BuchtaChristian + + Getting things in order: An introduction to the R package seriation + Journal of Statistical Software + 2008 + 25 + 3 + 1548-7660 + 10.18637/jss.v025.i03 + 1 + 34 + + + + + + HahslerMichael + + arulesViz: Interactive visualization of association rules with R + R Journal + 2017 + 9 + 2 + 2073-4859 + https://journal.r-project.org/archive/2017/RJ-2017-047/RJ-2017-047.pdf + 10.32614/RJ-2017-047 + 163 + 175 + + + + + + KuhnMax + + Building predictive models in R using the caret package + Journal of Statistical Software + 2008 + 28 + 5 + https://www.jstatsoft.org/index.php/jss/article/view/v028i05 + 10.18637/jss.v028.i05 + 1 + 26 + + + + + + KuhnMax + WickhamHadley + + Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. + 2020 + https://www.tidymodels.org + 10.32614/CRAN.package.tidymodels + + + + +