Python - seek in http response stream -


using urllibs (or urllibs2) , wanting want hopeless. solution?

i'm not sure how c# implementation works, but, internet streams not seekable, guess downloads data local file or in-memory object , seeks within there. python equivalent of abafei suggested , write data file or stringio , seek there.

however, if, comment on abafei's answer suggests, want retrieve particular part of file (rather seeking backwards , forwards through returned data), there possibility. urllib2 can used retrieve section (or 'range' in http parlance) of webpage, provided server supports behaviour.

the range header

when send request server, parameters of request given in various headers. 1 of these range header, defined in section 14.35 of rfc2616 (the specification defining http/1.1). header allows things such retrieve data starting 10,000th byte, or data between bytes 1,000 , 1,500.

server support

there no requirement server support range retrieval. servers return accept-ranges header (section 14.5 of rfc2616) along response report if support ranges or not. checked using head request. however, there no particular need this; if server not support ranges, return entire page , can extract desired portion of data in python before.

checking if range returned

if server returns range, must send content-range header (section 14.16 of rfc2616) along response. if present in headers of response, know range returned; if not present, entire page returned.

implementation urllib2

urllib2 allows add headers request, allowing ask server range rather entire page. following script takes url, start position, , (optionally) length on command line, , tries retrieve given section of page.

import sys import urllib2  # check command line arguments. if len(sys.argv) < 3:     sys.stderr.write("usage: %s url start [length]\n" % sys.argv[0])     sys.exit(1)  # create request given url. request = urllib2.request(sys.argv[1])  # add header specify range download. if len(sys.argv) > 3:     start, length = map(int, sys.argv[2:])     request.add_header("range", "bytes=%d-%d" % (start, start + length - 1)) else:     request.add_header("range", "bytes=%s-" % sys.argv[2])  # try response. raise urllib2.urlerror if there # problem (e.g., invalid url). response = urllib2.urlopen(request)  # if content-range header present, partial retrieval worked. if "content-range" in response.headers:     print "partial retrieval successful."      # header contains string 'bytes', followed space,     # range in format 'start-end', followed slash , total     # size of page (or asterix if total size unknown). lets     # range , total size this.     range, total = response.headers['content-range'].split(' ')[-1].split('/')      # print message giving range information.     if total == '*':         print "bytes %s of unknown total retrieved." % range     else:         print "bytes %s of total of %s retrieved." % (range, total)  # no header, partial retrieval unsuccessful. else:     print "unable use partial retrieval."  # , measure, lets check how data downloaded. data = response.read() print "retrieved data size: %d bytes" % len(data) 

using this, can retrieve final 2,000 bytes of python homepage:

blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387 partial retrieval successful. bytes 17387-19386 of total of 19387 retrieved. retrieved data size: 2000 bytes 

or 400 bytes middle of homepage:

blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400 partial retrieval successful. bytes 6000-6399 of total of 19387 retrieved. retrieved data size: 400 bytes 

however, google homepage not support ranges:

blair@blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500 unable use partial retrieval. retrieved data size: 9621 bytes 

in case, necessary extract data of interest in python prior further processing.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -