I started with the Iris dataset because it is the standard for testing HARD TASK. My purpose here was to establish a concept for the simple download of a HF dataset and its direct conversion into an ‘mlr3’ task.
+-------------------------------------------------------------------------------|
| IRIS DOWNLOAD PROCEDURE |
+-------------------------------------------------------------------------------|
| |
| 1]SITE WHICH IS MENTIONED ON WIKI ! |
| (https://huggingface.co/datasets) |
| | |
| v |
| 2] IRIS DATASET |
| (Repo: scikit-learn/iris) |
| | |
| v |
| 3] GONE INTO 'FILES AND VERSIONS' |
| ( 'Iris.csv') |
| | |
| v |
| 4] COPY DOWNLOAD LINK |
| (https://huggingface.co/datasets/scikit-learn/iris/resolve/main/Iris.csv) |
| |
| |
+-------------------------------------------------------------------------------|
library(mlr3)
# 1] Download-
iris_url <- "https://huggingface.co/datasets/scikit-learn/iris/resolve/main/Iris.csv"
iris_df <- read.csv(iris_url, stringsAsFactors = TRUE)
iris_df$Id <- NULL
# 2] Initialize Task
task_iris <- as_task_classif(iris_df, target = "Species", id = "iris")
# 3] Calling the Task
print(task_iris)
##
## ── <TaskClassif> (150x5) ───────────────────────────────────────────────────────
## • Target: Species
## • Target classes: Iris-setosa (33%), Iris-versicolor (33%), Iris-virginica
## (33%)
## • Properties: multiclass
## • Features (4):
## • dbl (4): PetalLengthCm, PetalWidthCm, SepalLengthCm, SepalWidthCm
When I successfully understood the dataset of Iris, I implemented the Pima Indians Diabetes dataset to strengthen the hard task. Moving to this dataset allowed me to demonstrate more data handling, such as managing binary classes and ensuring correct feature types in a real-world medical context.
+---------------------------------------------------------------------------------------------------------|
| PIMA DOWNLOAD PROCEDURE |
+---------------------------------------------------------------------------------------------------------|
| |
| 1]SITE WHICH IS MENTIONED ON WIKI ! |
| (https://huggingface.co/datasets) |
| | |
| v |
| 2] PIMA INDIANS DIABETES DATASET |
| |
| ( khoaguin/pima-indians- |
| diabetes-database) |
| | |
| v |
| 3] GONE INTO 'FILES AND VERSIONS' |
| ('diabetes.csv') |
| | |
| v |
| 4] COPY DOWNLOAD LINK |
| (https://huggingface.co/datasets/khoaguin/pima-indians-diabetes-database/resolve/main/diabetes.csv) |
| |
| |
+---------------------------------------------------------------------------------------------------------|
library(mlr3)
# 1] Download
pima_url <- "https://huggingface.co/datasets/khoaguin/pima-indians-diabetes-database/resolve/main/diabetes.csv"
pima_df <- read.csv(pima_url)
pima_df$y <- as.factor(pima_df$y)
# 2] Initialize Task
task_pima <- as_task_classif(pima_df, target = "y", id = "pima")
# 3] Calling the Task
print(task_pima)
##
## ── <TaskClassif> (768x9) ───────────────────────────────────────────────────────
## • Target: y
## • Target classes: 0 (positive class, 65%), 1 (35%)
## • Properties: twoclass
## • Features (8):
## • int (6): Age, BloodPressure, Glucose, Insulin, Pregnancies, SkinThickness
## • dbl (2): BMI, DiabetesPedigreeFunction