python - Beautifulsoup - get_text, output in a single line -

March 15, 2012

i trying extract text of following page , save single cell of csv file. however, keep getting linebreaks @ places don't see "special" characters (ie there no "\n", "\t", etc in text). second line of csv file has more 1 non-empty cell, instead of saving text single cell.

here code:

# -*- coding: utf-8 -*- #python3.x import urllib bs4 import beautifulsoup import requests, urllib, csv, re, sys csvfile=open('test.csv', 'w', encoding='cp850', errors='replace') writer=csv.writer(csvfile)  list_url= ["http://www.sec.gov/archives/edgar/data/1025315/0000950127-05-000239.txt"]  url in list_url:  base_url_parts = urllib.parse.urlparse(url)  while true:     raw_html = urllib.request.urlopen(url).read()     soup = beautifulsoup(raw_html)      #### scrape page desired info      text_10k=[]     ten_k=soup.get_text()     ten_k=ten_k.strip().replace("\t", " ").replace("\r", " ").replace('\n', ' ')     text_10k.append(ten_k)      #zip data     output_data=zip([text_10k])  #write observations csv file     writer=csv.writer(open('test_10k.csv','a',newline='', encoding='cp850', errors='replace'))     writer.writerows(output_data)     csvfile.flush()

i sure error simple, it's been months since i've used python... use refresher. many thanks!

edit: output long copy in full, here example:

line 1, cell 1: ['-----begin privacy-enhanced message-----\nproc-type ..... -8-", 'the change in working cap
line 2, cell 1: tal attributable loss the\nyear
line 2, cell 2: , reduction in cash due payments made on long-term notes payable.\n\n

i in single cell (line 1, cell 1), no linebreak characters. so:
line 1, cell 1: ['-----begin privacy-enhanced message-----\nproc-type ..... -8-", 'the change in working captal attributable loss the\nyearand reduction in cash due payments made on long-term notes payable.\n\n

*notice "i" goes missing in word "capital" when gets split between lines 1 , 2. not sure causes line break way.

edit2: made work saving .txt file (which works fine long open output in notepad++ or similar). still don't know why not work csv, though.

it appear in while true: program end stuck in while loop forever. changing if url: should let run once per url. should note, not run until added 'lxml' parser beautifulsoup soup = beautifulsoup(raw_html, 'lxml'). appear put each url's information single cell. because amount of information in cell large, cannot displayed in standard spreadsheet.

# -*- coding: utf-8 -*- # python3.x bs4 import beautifulsoup import urllib import csv  csvfile = open('test.csv', 'w', encoding='cp850', errors='replace') writer = csv.writer(csvfile)  list_url = ["http://www.sec.gov/archives/edgar/data/1025315/0000950127-05-000239.txt"]  url in list_url:     base_url_parts = urllib.parse.urlparse(url)     if url:         raw_html = urllib.request.urlopen(url).read()         soup = beautifulsoup(raw_html, 'lxml')          #### scrape page desired info         text_10k = []         ten_k = soup.get_text()         ten_k = ten_k.strip().replace("\t", " ").replace("\r", " ").replace('\n', ' ')         text_10k.append(ten_k)          #zip data         output_data=zip([text_10k])          #write observations csv file         writer=csv.writer(open('test_10k.csv','a',newline='', encoding='cp850', errors='replace'))         writer.writerows(output_data)         csvfile.flush()

Search This Blog

TSQL

python - Beautifulsoup - get_text, output in a single line -

Comments

Post a Comment

Popular posts from this blog

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

1111. appearing after print sequence - php -

android - How to create dynamically Fragment pager adapter -