Python - seek in http response stream -
using urllibs
(or urllibs2
) , wanting want hopeless. solution?
i'm not sure how c# implementation works, but, internet streams not seekable, guess downloads data local file or in-memory object , seeks within there. python equivalent of abafei suggested , write data file or stringio , seek there.
however, if, comment on abafei's answer suggests, want retrieve particular part of file (rather seeking backwards , forwards through returned data), there possibility. urllib2
can used retrieve section (or 'range' in http parlance) of webpage, provided server supports behaviour.
the range
header
when send request server, parameters of request given in various headers. 1 of these range
header, defined in section 14.35 of rfc2616 (the specification defining http/1.1). header allows things such retrieve data starting 10,000th byte, or data between bytes 1,000 , 1,500.
server support
there no requirement server support range retrieval. servers return accept-ranges
header (section 14.5 of rfc2616) along response report if support ranges or not. checked using head request. however, there no particular need this; if server not support ranges, return entire page , can extract desired portion of data in python before.
checking if range returned
if server returns range, must send content-range
header (section 14.16 of rfc2616) along response. if present in headers of response, know range returned; if not present, entire page returned.
implementation urllib2
urllib2
allows add headers request, allowing ask server range rather entire page. following script takes url, start position, , (optionally) length on command line, , tries retrieve given section of page.
import sys import urllib2 # check command line arguments. if len(sys.argv) < 3: sys.stderr.write("usage: %s url start [length]\n" % sys.argv[0]) sys.exit(1) # create request given url. request = urllib2.request(sys.argv[1]) # add header specify range download. if len(sys.argv) > 3: start, length = map(int, sys.argv[2:]) request.add_header("range", "bytes=%d-%d" % (start, start + length - 1)) else: request.add_header("range", "bytes=%s-" % sys.argv[2]) # try response. raise urllib2.urlerror if there # problem (e.g., invalid url). response = urllib2.urlopen(request) # if content-range header present, partial retrieval worked. if "content-range" in response.headers: print "partial retrieval successful." # header contains string 'bytes', followed space, # range in format 'start-end', followed slash , total # size of page (or asterix if total size unknown). lets # range , total size this. range, total = response.headers['content-range'].split(' ')[-1].split('/') # print message giving range information. if total == '*': print "bytes %s of unknown total retrieved." % range else: print "bytes %s of total of %s retrieved." % (range, total) # no header, partial retrieval unsuccessful. else: print "unable use partial retrieval." # , measure, lets check how data downloaded. data = response.read() print "retrieved data size: %d bytes" % len(data)
using this, can retrieve final 2,000 bytes of python homepage:
blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387 partial retrieval successful. bytes 17387-19386 of total of 19387 retrieved. retrieved data size: 2000 bytes
or 400 bytes middle of homepage:
blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400 partial retrieval successful. bytes 6000-6399 of total of 19387 retrieved. retrieved data size: 400 bytes
however, google homepage not support ranges:
blair@blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500 unable use partial retrieval. retrieved data size: 9621 bytes
in case, necessary extract data of interest in python prior further processing.
Comments
Post a Comment