python - Beautifulsoup - get_text, output in a single line -
i trying extract text of following page , save single cell of csv file. however, keep getting linebreaks @ places don't see "special" characters (ie there no "\n", "\t", etc in text). second line of csv file has more 1 non-empty cell, instead of saving text single cell.
here code:
# -*- coding: utf-8 -*- #python3.x import urllib bs4 import beautifulsoup import requests, urllib, csv, re, sys csvfile=open('test.csv', 'w', encoding='cp850', errors='replace') writer=csv.writer(csvfile) list_url= ["http://www.sec.gov/archives/edgar/data/1025315/0000950127-05-000239.txt"] url in list_url: base_url_parts = urllib.parse.urlparse(url) while true: raw_html = urllib.request.urlopen(url).read() soup = beautifulsoup(raw_html) #### scrape page desired info text_10k=[] ten_k=soup.get_text() ten_k=ten_k.strip().replace("\t", " ").replace("\r", " ").replace('\n', ' ') text_10k.append(ten_k) #zip data output_data=zip([text_10k]) #write observations csv file writer=csv.writer(open('test_10k.csv','a',newline='', encoding='cp850', errors='replace')) writer.writerows(output_data) csvfile.flush()
i sure error simple, it's been months since i've used python... use refresher. many thanks!
edit: output long copy in full, here example:
line 1, cell 1: ['-----begin privacy-enhanced message-----\nproc-type ..... -8-", 'the change in working cap
line 2, cell 1: tal attributable loss the\nyear
line 2, cell 2: , reduction in cash due payments made on long-term notes payable.\n\n
i in single cell (line 1, cell 1), no linebreak characters. so:
line 1, cell 1: ['-----begin privacy-enhanced message-----\nproc-type ..... -8-", 'the change in working captal attributable loss the\nyearand reduction in cash due payments made on long-term notes payable.\n\n
*notice "i" goes missing in word "capital" when gets split between lines 1 , 2. not sure causes line break way.
edit2: made work saving .txt file (which works fine long open output in notepad++ or similar). still don't know why not work csv, though.
it appear in while true:
program end stuck in while loop forever. changing if url:
should let run once per url. should note, not run until added 'lxml' parser beautifulsoup soup = beautifulsoup(raw_html, 'lxml'
). appear put each url's information single cell. because amount of information in cell large, cannot displayed in standard spreadsheet.
# -*- coding: utf-8 -*- # python3.x bs4 import beautifulsoup import urllib import csv csvfile = open('test.csv', 'w', encoding='cp850', errors='replace') writer = csv.writer(csvfile) list_url = ["http://www.sec.gov/archives/edgar/data/1025315/0000950127-05-000239.txt"] url in list_url: base_url_parts = urllib.parse.urlparse(url) if url: raw_html = urllib.request.urlopen(url).read() soup = beautifulsoup(raw_html, 'lxml') #### scrape page desired info text_10k = [] ten_k = soup.get_text() ten_k = ten_k.strip().replace("\t", " ").replace("\r", " ").replace('\n', ' ') text_10k.append(ten_k) #zip data output_data=zip([text_10k]) #write observations csv file writer=csv.writer(open('test_10k.csv','a',newline='', encoding='cp850', errors='replace')) writer.writerows(output_data) csvfile.flush()
Comments
Post a Comment