반응형
첫째, 저는 파이썬에 관해서는 완전히 초보자입니다. 그러나 RSS 피드를보고 링크를 열고 기사에서 텍스트를 추출하는 코드를 작성했습니다. 이것이 내가 지금까지 가지고있는 것입니다.
from BeautifulSoup import BeautifulSoup
import feedparser
import urllib
# Dictionaries
links = {}
titles = {}
# Variables
n = 0
rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80- 30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"
# Parse the RSS feed
feed = feedparser.parse(rss_url)
# view the entire feed, one entry at a time
for post in feed.entries:
# Create variables from posts
link = post.link
title = post.title
# Add the link to the dictionary
n += 1
links[n] = link
for k,v in links.items():
# Open RSS feed
page = urllib.urlopen(v).read()
page = str(page)
soup = BeautifulSoup(page)
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
# Strip ampersand codes and WATCH:
page = re.sub('&\w+;','',page)
page = re.sub('WATCH:','',page)
# Print Page
print(page)
print(" ")
# To stop after 3rd article, just whilst testing ** to be removed **
if (k >= 3):
break
그러면 다음과 같은 출력이 생성됩니다.
>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
?Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago. Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago. The higher figures reflected the effects both of volume and exchange rate factors.
The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).
The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations. In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,
>>>
문제는 이것이 각 기사의 첫 번째 단락이지만 전체 기사를 보여줄 필요가 있다는 것입니다. 어떤 도움이라도 감사하게받을 것입니다.
해결 방법
가까워지고 있습니다!
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})
기사 본문에 초점을 맞 춥니 다.
참조 페이지 https://stackoverflow.com/questions/12451997
반응형
'파이썬' 카테고리의 다른 글
파이썬 What is a clean way to convert a string percent to a float? (0) | 2021.02.05 |
---|---|
파이썬 How to maximize a plt.show() window using Python (0) | 2021.02.05 |
파이썬 거북이에서이 사각형을 어떻게 채울 수 있습니까?-Python (0) | 2021.02.05 |
파이썬 목록의 항목을 단일 문자열로 연결하는 방법은 무엇입니까? (0) | 2021.02.05 |
파이썬 Matplotlib color according to class labels (0) | 2021.02.05 |
댓글