NLTK와 pandas로 용어 문서 매트릭스를 만들려고합니다. 다음 함수를 작성했습니다.

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    '''to create a Term Document Matrix from a NLTK Corpus'''
    fd_list = []
    for x in range(0, len(xCorpus.fileids())):
        fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
    DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
    DTM.fillna(0,inplace = True)
    return DTM.T

그것을 실행

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

x = fnDTM_Corpus(newcorpus)

말뭉치에있는 몇 개의 작은 파일에서 잘 작동하지만 4,000 개 파일 (각각 약 2kb)의 말뭉치로 실행하려고하면 MemoryError 가 발생합니다.

누락 된 것이 있습니까?

32 비트 파이썬을 사용하고 있습니다. (am Windows 7, 64 비트 OS, Core Quad CPU, 8GB RAM). 이 크기의 말뭉치에 64 비트를 사용해야합니까?

해결 방법

Radim과 Larsmans에게 감사드립니다. My objective was to have a DTM like the one you get in R tm.

다른 사람이 유용하다고 생각하기 위해 여기에 게시합니다.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

def fn_tdm_df(docs, xColNames = None, **kwargs):
    ''' create a term document matrix as pandas DataFrame
    with **kwargs you can pass arguments of CountVectorizer
    if xColNames is given the dataframe gets columns Names'''

    #initialize the  vectorizer
    vectorizer = CountVectorizer(**kwargs)
    x1 = vectorizer.fit_transform(docs)
    #create dataFrame
    df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
    if xColNames is not None:
        df.columns = xColNames

    return df

디렉토리의 텍스트 목록에서 사용하려면

DIR = 'C:/Data/'

def fn_CorpusFromDIR(xDIR):
    ''' functions to create corpus from a Directories
    Input: Directory
    Output: A dictionary with 
             Names of files ['ColNames']
             the text in corpus ['docs']'''
    import os
    Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
               ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
    return Res

to create the dataframe

d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
          xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
          stop_words=None, charset_error = 'replace')

참조 페이지 https://stackoverflow.com/questions/15899861

'파이썬' 카테고리의 다른 글

파이썬 Python의 중첩 함수 (0)	2021.01.22
파이썬 다른 콘솔에서 subprocess.Popen (0)	2021.01.22
파이썬 matplotlib 막대 그래프 검정-막대 테두리를 제거하는 방법 (0)	2021.01.21
파이썬 Python에서 csv 파일에 헤더를 추가하는 방법은 무엇입니까? (0)	2021.01.21
파이썬 MAC 주소 얻기 (0)	2021.01.21

프로그램 샘플 소스

파이썬 NLTK를 사용한 효율적인 용어 문서 매트릭스

해결 방법

to create the dataframe

'파이썬' 카테고리의 다른 글

댓글

티스토리툴바

파이썬 NLTK를 사용한 효율적인 용어 문서 매트릭스

해결 방법

to create the dataframe

'파이썬' 카테고리의 다른 글

관련글

댓글

티스토리툴바