programing

샘플 함수를 사용하여 데이터를 교육/테스트 세트로 분할하는 방법

bestprogram 2023. 7. 6. 22:26

샘플 함수를 사용하여 데이터를 교육/테스트 세트로 분할하는 방법

R을 사용하기 시작한 지 얼마 되지 않았는데 다음 샘플 코드로 데이터 세트를 통합하는 방법을 잘 모르겠습니다.

sample(x, size, replace = FALSE, prob = NULL)

교육(75%) 및 테스트(25%)에 넣어야 하는 데이터셋이 있습니다.X와 사이즈에 어떤 정보를 넣어야 하는지 잘 모르겠습니다.x는 데이터 세트 파일이고, 사이즈는 샘플이 몇 개 있습니까?

데이터 파티셔닝을 달성하기 위한 다양한 접근 방식이 있습니다.더 다완접다같습니다과음은방식보근벽한▁the▁take▁for를 살펴보세요.createDataPartition에서 합니다.caret꾸러미

다음은 간단한 예입니다.

data(mtcars)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)

train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]

다음을 통해 쉽게 수행할 수 있습니다.

set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

caTools 패키지 사용:

require(caTools)
set.seed(101) 
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

나는 사용할 것입니다.dplyr이를 위해, 매우 단순하게 만듭니다.데이터 세트에 ID 변수가 필요한데, 이는 세트를 만드는 것뿐만 아니라 프로젝트 중 추적 가능성에도 도움이 되는 좋은 방법입니다.아직 포함되지 않은 경우 추가합니다.

mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(mtcars, train, by = 'id')

이것은 거의 같은 코드이지만, 더 보기 좋게.

bound <- floor((nrow(df)/4)*3)         #define % of training and test set

df <- df[sample(nrow(df)), ]           #sample rows 
df.train <- df[1:bound, ]              #get training set
df.test <- df[(bound+1):nrow(df), ]    #get test set

library(caret)
intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
training<-m_train[intrain,]
testing<-m_train[-intrain,]

'a'를 열차(70%)로 나누고 테스트(30%)를 수행합니다.

    a # original data frame
    library(dplyr)
    train<-sample_frac(a, 0.7)
    sid<-as.numeric(rownames(train)) # because rownames() returns character
    test<-a[-sid,]

다 했어요.

제 해결책은 기본적으로 디코아와 같지만 해석하기가 조금 더 쉽습니다.

data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]

rsample 패키지 사용을 제안할 수 있습니다.

# choosing 75% of the data to be the training data
data_split <- initial_split(data, prop = .75)
# extracting training data and test data as two seperate dataframes
data_train <- training(data_split)
data_test  <- testing(data_split)

방법들을 다하는 사람을 못 봤습니다.TRUE/FALSE데이터를 선택하거나 선택 취소합니다.그래서 저는 그 기술을 활용한 방법을 공유하려고 생각했습니다.

n = nrow(dataset)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25))

training = dataset[split, ]
testing = dataset[!split, ]

설명.

R에서 데이터를 선택하는 방법은 여러 가지가 있으며, 대부분의 사람들은 양/음 지수를 사용하여 각각 선택/선택 해제합니다.그러나 다음을 사용하여 동일한 기능을 수행할 수 있습니다.TRUE/FALSE선택/해제합니다.

다음 예를 생각해 보십시오.

# let's explore ways to select every other element
data = c(1, 2, 3, 4, 5)


# using positive indices to select wanted elements
data[c(1, 3, 5)]
[1] 1 3 5

# using negative indices to remove unwanted elements
data[c(-2, -4)]
[1] 1 3 5

# using booleans to select wanted elements
data[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

# R recycles the TRUE/FALSE vector if it is not the correct dimension
data[c(TRUE, FALSE)]
[1] 1 3 5

멋진 dplyr 라이브러리를 사용하는 더 간단하고 간단한 방법:

library(dplyr)
set.seed(275) #to get repeatable data

data.train <- sample_frac(Default, 0.7)

train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train_index, ]

scorecard에는 비율과 시드를할 수 .

library(scorecard)

dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)

테스트 및 열차 데이터는 목록에 저장되며, 호출을 통해 액세스할 수 있습니다.dt_list$train그리고.dt_list$test

입력하는 경우:

?sample

는 샘플 함수의 파라미터가 의미하는 바를 설명하는 도움말 메뉴를 시작합니다.

전문가는 아니지만 다음과 같은 코드가 있습니다.

data <- data.frame(matrix(rnorm(400), nrow=100))
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))
test <- splitdata[[1]]
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])

그러면 75%의 열차와 25%의 테스트가 제공됩니다.

이 솔루션은 행을 섞고 처음 75%의 행을 트레인으로 사용하고 마지막 25%를 테스트로 사용합니다.아주 단순해요!

row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]

여기서 데이터를 특정 비율로 나눌 수 있습니다. 즉, 80%의 열차와 20%의 테스트 데이터 세트입니다.

ind <- sample(2, nrow(dataName), replace = T, prob = c(0.8,0.2))
train <- dataName[ind==1, ]
test <- dataName[ind==2, ]

사용자가 정확히 원하는 크기는 아니지만 다른 사용자에게 유용할 수 있는 동일한 크기의 하위 표본을 만드는 함수 아래에 있습니다.내 경우 과적합을 테스트하기 위해 더 작은 샘플에 여러 분류 트리를 만듭니다.

df_split <- function (df, number){
  sizedf      <- length(df[,1])
  bound       <- sizedf/number
  list        <- list() 
  for (i in 1:number){
    list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
  }
  return(list)
}

예:

x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,]    1
# [2,]    2
# [3,]    3
# [4,]    4
# [5,]    5
# [6,]    6
# [7,]    7
# [8,]    8
# [9,]    9
#[10,]   10

x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2

# [[2]]
# [1] 3 4

# [[3]]
# [1] 5 6

# [[4]]
# [1] 7 8

# [[5]]
# [1] 9 10

R 샘플 코드에서 caTools 패키지를 사용하는 방법은 다음과 같습니다.

data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)

기본 R을 사용합니다. 기.runif부터 10까지의 한 분포 합니다.컷오프 값(아래 예에서 train.size)을 변경하면 컷오프 값보다 낮은 랜덤 레코드의 비율이 항상 거의 같습니다.

data(mtcars)
set.seed(123)

#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size

#train
train.df<-mtcars[train.ind,]


#test
test.df<-mtcars[!train.ind,]

require(caTools)

set.seed(101)            #This is used to create same samples everytime

split1=sample.split(data$anycol,SplitRatio=2/3)

train=subset(data,split1==TRUE)

test=subset(data,split1==FALSE)

그sample.split()함수는 데이터 프레임에 'split1' 열을 하나 더 추가하고 행의 2/3은 이 값을 TRUE로 하고 나머지 행은 FALSE로 합니다.이제 split1이 TRUE인 행은 트레인에 복사되고 다른 행은 테스트 데이터 프레임에 복사됩니다.

df가 데이터 프레임이며 75%의 트레인과 25%의 테스트를 생성한다고 가정합니다.

all <- 1:nrow(df)
train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE))
test_i <- all[-train_i]

그런 다음 열차를 만들고 데이터 프레임을 테스트합니다.

df_train <- df[train_i,]
df_test <- df[test_i,]

»:sample재현 가능한 결과를 찾는 경우 분할할 수 있습니다.데이터가 조금이라도 변경되면 사용하더라도 데이터 분할이 달라집니다.set.seed예를 들어, 데이터에서 정렬된 ID 목록이 1에서 10 사이의 모든 숫자라고 가정합니다.하나의 관측치(예: 4)를 삭제한 경우에는 위치별로 표본을 추출하면 5-10개의 모든 이동된 위치가 있기 때문에 다른 결과를 얻을 수 있습니다.

다른 방법은 해시 함수를 사용하여 ID를 일부 의사 난수에 매핑한 다음 이러한 숫자의 모드에서 샘플링하는 것입니다.이제 할당은 상대적 위치가 아니라 각 관측치의 해시에 의해 결정되기 때문에 이 표본이 더 안정적입니다.

예:

require(openssl)  # for md5
require(data.table)  # for the demo data

set.seed(1)  # this won't help `sample`

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
sample2 <- sample1[-sample(N, 1)]  # randomly drop one observation from sample1

# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))

[1] 9999

# row splitting yields very different test sets, even though we've set the seed
test <- sample(N-1, N/2, replace = F)

test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

# to fix that, we can use some hash function to sample on the last digit

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

# hash splitting preserves the similarity, because the assignment of test/train 
# is determined by the hash of each obs., and not by its relative location in the data
# which may change 
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

할당이 확률적이기 때문에 샘플 크기가 정확히 5000은 아니지만, 큰 수의 법칙으로 인해 큰 샘플에서는 문제가 되지 않습니다.

참고 항목: http://blog.richardweiss.org/2016/12/25/hash-splits.html 및 https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo

이거랑 부딪혔어요, 도움이 될 수도 있어요.

set.seed(12)
data = Sonar[sample(nrow(Sonar)),]#reshufles the data
bound = floor(0.7 * nrow(data))
df_train = data[1:bound,]
df_test = data[(bound+1):nrow(data),]

인덱스 행 "rowid"를 만들고 안티 조인을 사용하여 = "rowid"를 사용하여 필터링합니다.분할 후 %>% select(-rowid)를 사용하여 rowid 열을 제거할 수 있습니다.

데이터 <-tible::rowid_to_column(데이터)

set.seed (11081995)

검정 데이터 <- 데이터 %>% slice_sample(prop = 0.2)

train data <- anti_proxy (데이터, 테스트 데이터, = "rowid" 기준)

set.seed(123)
llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0)) 
wmydata<-mydata[llwork, ]
tmydata<-mydata[-llwork, ]

이렇게 하면 문제가 해결될 것 같습니다.

df = data.frame(read.csv("data.csv"))
# Split the dataset into 80-20
numberOfRows = nrow(df)
bound = as.integer(numberOfRows *0.8)
train=df[1:bound ,2]
test1= df[(bound+1):numberOfRows ,2]

사용하는 것을 선호합니다.dplyr로.mutate가치관

set.seed(1)
mutate(x, train = runif(1) < 0.75)

계속 사용할 수 있습니다.dplyr::filter와 같은 도우미 기능으로

data.split <- function(is_train = TRUE) {
    set.seed(1)
    mutate(x, train = runif(1) < 0.75) %>%
    filter(train == is_train)
}

여러 데이터 테이블을 사용하는 경우 코드를 반복하지 않으려면 이 작업을 더 빨리 수행하는 기능(처음에는 제대로 작동하지 않을 수 있음)을 작성했습니다.

xtrain <- function(data, proportion, t1, t2){
  data <- data %>% rowid_to_column("rowid")
  train <- slice_sample(data, prop = proportion)
  assign(t1, train, envir = .GlobalEnv)
  test <- data %>% anti_join(as.data.frame(train), by = "rowid")
  assign(t2, test, envir = .GlobalEnv)
}

xtrain(iris, .80, 'train_set', 'test_set')

당신은 dplyr과 tible을 장착해야 할 것입니다.여기에는 표본 추출에 사용할 비율인 주어진 데이터 집합과 두 개의 개체 이름이 사용됩니다.함수는 테이블을 만든 다음 글로벌 환경에서 테이블을 개체로 할당합니다.

사용해 보다idx <- sample(2, nrow(data), replace = TRUE, prob = c(0.75, 0.25))그리고 제공된 ID를 사용하여 분할 데이터에 액세스합니다.training <- data[idx == 1,] testing <- data[idx == 2,]

행과 열에 대해 R 인덱스를 사용하여 행을 여러 개 선택하는 매우 간단한 방법이 있습니다.이렇게 하면 여러 행(예: 데이터의 첫 번째 80%)이 지정된 데이터 세트를 깨끗하게 분할할 수 있습니다.

R에서 모든 행과 열은 인덱싱되므로 DataSetName[1,1]은 "DataSetName"의 첫 번째 열과 첫 번째 행에 할당된 값입니다.[x,]를 사용하여 행을 선택하고 [,x]를 사용하여 열을 선택할 수 있습니다.

예:100개의 행이 있는 "data"라는 편리한 이름의 데이터 세트가 있는 경우 다음을 사용하여 처음 80개의 행을 볼 수 있습니다.

보기(데이터 [1:80,])

마찬가지로 다음을 사용하여 행을 선택하고 부분 집합을 지정할 수 있습니다.

train = data [1:80,]

검정 = 데이터 [81:100,]

이제 재샘플링이 불가능한 상태에서 데이터를 두 부분으로 나눕니다.빠르고 쉬운.

언급URL : https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function

'programing' 카테고리의 다른 글

순간 스크립트 모듈 시스템 입력이상하게 행동하는 JS (0)	2023.07.06
Firebase용 Cloud Functions에 대한 시간 초과 설정이 콘솔에서 유지되지 않습니다. 버그입니까? (0)	2023.07.06
장고 쿼리 세트를 딕트 목록으로 변환하려면 어떻게 해야 합니까? (0)	2023.07.06
Angular 응용 프로그램에서 Puppeteer를 사용하는 방법 (0)	2023.07.06
판다를 사용하여 둘 이상의 최대 열 찾기 (0)	2023.07.06

현재글샘플 함수를 사용하여 데이터를 교육/테스트 세트로 분할하는 방법

각종 프로그래밍 정보를 다루는 블로그입니다.

bash, ASP.NET, c, mariadb, json, Wordpress, Angular, Excel, reactjs, sql-server, angularJS, git, Oracle, python, jquery, ajax, Android, mongodb, MYSQL, spring-boot,

Today :
Yesterday :

bestprogram

샘플 함수를 사용하여 데이터를 교육/테스트 세트로 분할하는 방법

샘플 함수를 사용하여 데이터를 교육/테스트 세트로 분할하는 방법

설명.

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

샘플 함수를 사용하여 데이터를 교육/테스트 세트로 분할하는 방법

샘플 함수를 사용하여 데이터를 교육/테스트 세트로 분할하는 방법

설명.

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바