bittorrent - creating daemon using Python libtorrent for fetching meta data of 100k+ torrents -
i trying fetch meta data of around 10k+ torrents per day using python libtorrent.
this current flow of code
- start libtorrent session.
- get total counts of torrents need metadata uploaded within last 1 day.
- get torrent hashes db in chunks
- create magnet link using hashes , add magnet uri's in session creating handle each magnet uri.
- sleep second while meta data fetched , keep checking whether meta data s found or not.
- if meta data received add in db else check if have been looking meta data around 10 minutes , if yes remove handle i.e. dont metadata no more now.
- do above indefinitely. , save session state future.
so far have tried this.
#!/usr/bin/env python # file run client or daemon , fetch torrent meta data i.e. torrent files magnet uri import libtorrent lt # libtorrent library import tempfile # settings parameters while fetching metadata temp dir import sys #getting arguiments shell or exit script time import sleep #sleep import shutil # removing directory tree temp directory import os.path # getting pwd , other things pprint import pprint # debugging, showing object data import mysqldb # db connectivity import os datetime import date, timedelta session = lt.session(lt.fingerprint("ut", 3, 4, 5, 0), flags=0) session.listen_on(6881, 6891) session.add_extension('ut_metadata') session.add_extension('ut_pex') session.add_extension('smart_ban') session.add_extension('metadata_transfer') session_save_filename = "/magnet2torrent/magnet_to_torrent_daemon.save_state" if(os.path.isfile(session_save_filename)): fileread = open(session_save_filename, 'rb') session.load_state(lt.bdecode(fileread.read())) fileread.close() print('session loaded file') else: print('new session started') session.add_dht_router("router.utorrent.com", 6881) session.add_dht_router("router.bittorrent.com", 6881) session.add_dht_router("dht.transmissionbt.com", 6881) session.add_dht_router("dht.aelitis.com", 6881) session.start_dht() session.start_lsd() session.start_upnp() session.start_natpmp() alive = true while alive: db_conn = mysqldb.connect( host = '', user = '', passwd = '', db = '', unix_socket='/mysql/mysql.sock') # open database connection #print('reconnecting') #get records enabled = 0 , uploaded within yesterday subset_count = 100 ; yesterday = date.today() - timedelta(1) yesterday = yesterday.strftime('%y-%m-%d %h:%m:%s') #print(yesterday) total_count_query = ("select count(*) total_count content upload_date > '"+ yesterday +"' , enabled = '0' ") #print(total_count_query) try: total_count_cursor = db_conn.cursor()# prepare cursor object using cursor() method total_count_cursor.execute(total_count_query) # execute sql command total_count_results = total_count_cursor.fetchone() # fetch rows in list of lists. total_count = total_count_results[0] print(total_count) except: print "error: unable select data" total_pages = total_count/subset_count #print(total_pages) current_page = 1 while(current_page <= total_pages): from_count = (current_page * subset_count) - subset_count #print(current_page) #print(from_count) hashes = [] get_mysql_data_query = ("select hash content upload_date > '" + yesterday +"' , enabled = '0' order record_num desc limit "+ str(from_count) +" , " + str(subset_count) +" ") #print(get_mysql_data_query) try: get_mysql_data_cursor = db_conn.cursor()# prepare cursor object using cursor() method get_mysql_data_cursor.execute(get_mysql_data_query) # execute sql command get_mysql_data_results = get_mysql_data_cursor.fetchall() # fetch rows in list of lists. row in get_mysql_data_results: hashes.append(row[0].upper()) except: print "error: unable select data" #print(hashes) handles = [] hash in hashes: tempdir = tempfile.mkdtemp() add_magnet_uri_params = { 'save_path': tempdir, 'duplicate_is_error': true, 'storage_mode': lt.storage_mode_t(2), 'paused': false, 'auto_managed': true, 'duplicate_is_error': true } magnet_uri = "magnet:?xt=urn:btih:" + hash.upper() + "&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2ftracker.publicbt.com%3a80&tr=udp%3a%2f%2ftracker.ccc.de%3a80" #print(magnet_uri) handle = lt.add_magnet_uri(session, magnet_uri, add_magnet_uri_params) handles.append(handle) #push handle in handles list #print("handles length :") #print(len(handles)) while(len(handles) != 0): h in handles: #print("inside handles each loop") if h.has_metadata(): torinfo = h.get_torrent_info() final_info_hash = str(torinfo.info_hash()) final_info_hash = final_info_hash.upper() torfile = lt.create_torrent(torinfo) torcontent = lt.bencode(torfile.generate()) tfile_size = len(torcontent) try: insert_cursor = db_conn.cursor()# prepare cursor object using cursor() method insert_cursor.execute("""insert dht_tfiles (hash, tdata) values (%s, %s)""", [final_info_hash , torcontent] ) db_conn.commit() #print "data inserted in db" except mysqldb.error, e: try: print "mysql error [%d]: %s" % (e.args[0], e.args[1]) except indexerror: print "mysql error: %s" % str(e) shutil.rmtree(h.save_path()) # remove temp data directory session.remove_torrent(h) # remove torrnt handle session handles.remove(h) #remove handle list else: if(h.status().active_time > 600): # check if handle more 10 minutes old i.e. 600 seconds #print('remove_torrent') shutil.rmtree(h.save_path()) # remove temp data directory session.remove_torrent(h) # remove torrnt handle session handles.remove(h) #remove handle list sleep(1) #print('sleep1') #print('sleep10') #sleep(10) current_page = current_page + 1 #save session state filewrite = open(session_save_filename, "wb") filewrite.write(lt.bencode(session.save_state())) filewrite.close() print('sleep60') sleep(60) #save session state filewrite = open(session_save_filename, "wb") filewrite.write(lt.bencode(session.save_state())) filewrite.close()
i tried kept above script running overnight , found around 1200 torrent's meta data found in overnight session. looking improve performance of script.
i have tried decoding save_state
file , noticed there 700+ dht nodes
connected to. not dht
not running,
what planning is, keep handles active
in session indefinitely while meta data not fetched. , not going remove handles after 10 minutes if no meta data fetched in 10 minutes, doing it.
i have few questions regarding lib-torrent python bindings.
- how many handles can keep running ? there limit running handles ?
- will running 10k+ or 100k handles slow down system ? or eat resources ? if yes resources ? mean ram , network ?
- i behind firewall , can blocked incoming port causing slow speed of metadata fetching ?
- can dht server router.bittorrent.com or other ban ip address sending many requests ?
- can other peers ban ip address if find out making many requests fot fetching meta data ?
- can run multiple instances of script ? or may multi-threading ? give better performance ?
- if using multiple instances of same script, each script unique node-id depending on ip , port using , viable solution ?
is there better approach ? achieving trying ?
i can't answer questions specific libtorrent's apis, of questions apply bittorrent in general.
will running 10k+ or 100k handles slow down system ? or eat resources ? if yes resources ? mean ram , network ?
metadata-downloads shouldn't use resources since not full torrent-downloads yet, i.e. can't allocate actual files or that. need ram/disk space metadata once grab first chunk of those.
i behind firewall , can blocked incoming port causing slow speed of metadata fetching ?
yes, reducing number of peers can establish connections becomes more difficult fetch metadata (or establish connection @ all) on swarms low peer count.
nats can cause same issue.
can dht server router.bittorrent.com or other ban ip address sending many requests ?
router.bittorrent.com bootstrap node, not server per se. lookups don't query single node, query many different (among millions). yes, individual nodes can ban, or more rate-limit, you.
this can mitigated looking randomly distributed ids spread load across dht keyspace.
can run multiple instances of script ? or may multi-threading ? give better performance ?
aiui libtorrent sufficiently non-blocking or multi-threaded can schedule many torrents @ once.
i don't know if libtorrent has rate-limit outgoing dht requests.
if using multiple instances of same script, each script unique node-id depending on ip , port using , viable solution ?
if mean dht node id, they're derived ip (as per bep 42), not port. although random element included, limited amount of ids can obtained per ip.
and of might might applicable scenario: http://blog.libtorrent.org/2012/01/seeding-a-million-torrents/
and option my own dht implementation includes cli bulk-fetch torrents.
Comments
Post a Comment