파이썬 데이터 프레임에서 word_tokenize를 사용하는 방법

최근에 텍스트 분석을 위해 nltk 모듈을 사용하기 시작했습니다. 나는 한 지점에 갇혀 있습니다. 데이터 프레임의 특정 행에서 사용되는 모든 단어를 얻기 위해 데이터 프레임에서 word_tokenize를 사용하고 싶습니다.

data example:
       text
1.   This is a very good site. I will recommend it to others.
2.   Can you please give me a call at 9983938428. have issues with the listings.
3.   good work! keep it up
4.   not a very helpful site in finding home decor. 

expected output:

1.   'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.'
2.   'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings'
3.   'good','work','!','keep','it','up'
4.   'not','a','very','helpful','site','in','finding','home','decor'

기본적으로 모든 단어를 분리하고 데이터 프레임에서 각 텍스트의 길이를 찾고 싶습니다.

word_tokenize가 문자열에 대해 가능하다는 것을 알고 있지만 전체 데이터 프레임에 적용하는 방법은 무엇입니까?

도와주세요!

미리 감사드립니다 ...

해결 방법

DataFrame API의 apply 메소드를 사용할 수 있습니다.

import pandas as pd
import nltk

df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)

산출:

>>> df
                                           sentences  0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  
0  [This, is, a, very, good, site, ., I, will, re...  
1  [Can, you, please, give, me, a, call, at, 9983...  
2                      [good, work, !, keep, it, up]

각 텍스트의 길이를 찾으려면 적용 및 람다 기능 을 다시 사용해보십시오.

df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)

>>> df
                                           sentences  0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  sents_length  
0  [This, is, a, very, good, site, ., I, will, re...            14  
1  [Can, you, please, give, me, a, call, at, 9983...            15  
2                      [good, work, !, keep, it, up]             6

참조 페이지 https://stackoverflow.com/questions/33098040

'파이썬' 카테고리의 다른 글

파이썬 Mapping dictionary value to list (0)	2020.11.14
파이썬 How to quickly parse a list of strings (0)	2020.11.14
파이썬 Merge multiple column values into one column in python pandas (0)	2020.11.14
파이썬에서 현재 날짜 시간의 문자열 형식을 어떻게 얻습니까? (0)	2020.11.14
파이썬 경로가 Python을 사용하여 크로스 플랫폼 방식으로 절대 경로인지 상대 경로인지 확인하는 방법은 무엇입니까? (0)	2020.11.14

프로그램 샘플 소스

파이썬 데이터 프레임에서 word_tokenize를 사용하는 방법

해결 방법

'파이썬' 카테고리의 다른 글

댓글

티스토리툴바

파이썬 데이터 프레임에서 word_tokenize를 사용하는 방법

해결 방법

'파이썬' 카테고리의 다른 글

관련글

댓글

티스토리툴바