bittorrent - creating daemon using Python libtorrent for fetching meta data of 100k+ torrents -


i trying fetch meta data of around 10k+ torrents per day using python libtorrent.

this current flow of code

  1. start libtorrent session.
  2. get total counts of torrents need metadata uploaded within last 1 day.
  3. get torrent hashes db in chunks
  4. create magnet link using hashes , add magnet uri's in session creating handle each magnet uri.
  5. sleep second while meta data fetched , keep checking whether meta data s found or not.
  6. if meta data received add in db else check if have been looking meta data around 10 minutes , if yes remove handle i.e. dont metadata no more now.
  7. do above indefinitely. , save session state future.

so far have tried this.

#!/usr/bin/env python # file run client or daemon , fetch torrent meta data i.e. torrent files magnet uri  import libtorrent lt # libtorrent library import tempfile # settings parameters while fetching metadata temp dir import sys #getting arguiments shell or exit script time import sleep #sleep import shutil # removing directory tree temp directory  import os.path # getting pwd , other things pprint import pprint # debugging, showing object data import mysqldb # db connectivity  import os datetime import date, timedelta  session = lt.session(lt.fingerprint("ut", 3, 4, 5, 0), flags=0) session.listen_on(6881, 6891) session.add_extension('ut_metadata') session.add_extension('ut_pex') session.add_extension('smart_ban') session.add_extension('metadata_transfer')  session_save_filename = "/magnet2torrent/magnet_to_torrent_daemon.save_state"  if(os.path.isfile(session_save_filename)):      fileread = open(session_save_filename, 'rb')     session.load_state(lt.bdecode(fileread.read()))     fileread.close()     print('session loaded file') else:     print('new session started')  session.add_dht_router("router.utorrent.com", 6881) session.add_dht_router("router.bittorrent.com", 6881) session.add_dht_router("dht.transmissionbt.com", 6881) session.add_dht_router("dht.aelitis.com", 6881)  session.start_dht() session.start_lsd() session.start_upnp() session.start_natpmp()  alive = true while alive:      db_conn = mysqldb.connect(  host = '',  user = '',  passwd = '',    db = '',    unix_socket='/mysql/mysql.sock') # open database connection     #print('reconnecting')     #get records enabled = 0 , uploaded within yesterday      subset_count = 100 ;      yesterday = date.today() - timedelta(1)     yesterday = yesterday.strftime('%y-%m-%d %h:%m:%s')     #print(yesterday)      total_count_query = ("select count(*) total_count content upload_date > '"+ yesterday +"' , enabled = '0' ")     #print(total_count_query)     try:         total_count_cursor = db_conn.cursor()# prepare cursor object using cursor() method         total_count_cursor.execute(total_count_query) # execute sql command         total_count_results = total_count_cursor.fetchone() # fetch rows in list of lists.         total_count = total_count_results[0]         print(total_count)     except:             print "error: unable select data"      total_pages = total_count/subset_count     #print(total_pages)      current_page = 1     while(current_page <= total_pages):         from_count = (current_page * subset_count) - subset_count          #print(current_page)         #print(from_count)          hashes = []          get_mysql_data_query = ("select hash content upload_date > '" + yesterday +"' , enabled = '0' order record_num desc limit "+ str(from_count) +" , " + str(subset_count) +" ")         #print(get_mysql_data_query)         try:             get_mysql_data_cursor = db_conn.cursor()# prepare cursor object using cursor() method             get_mysql_data_cursor.execute(get_mysql_data_query) # execute sql command             get_mysql_data_results = get_mysql_data_cursor.fetchall() # fetch rows in list of lists.             row in get_mysql_data_results:                 hashes.append(row[0].upper())         except:             print "error: unable select data"          #print(hashes)          handles = []          hash in hashes:             tempdir = tempfile.mkdtemp()             add_magnet_uri_params = {                 'save_path': tempdir,                 'duplicate_is_error': true,                 'storage_mode': lt.storage_mode_t(2),                 'paused': false,                 'auto_managed': true,                 'duplicate_is_error': true             }             magnet_uri = "magnet:?xt=urn:btih:" + hash.upper() + "&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2ftracker.publicbt.com%3a80&tr=udp%3a%2f%2ftracker.ccc.de%3a80"             #print(magnet_uri)             handle = lt.add_magnet_uri(session, magnet_uri, add_magnet_uri_params)             handles.append(handle) #push handle in handles list          #print("handles length :")         #print(len(handles))          while(len(handles) != 0):             h in handles:                 #print("inside handles each loop")                 if h.has_metadata():                     torinfo = h.get_torrent_info()                     final_info_hash = str(torinfo.info_hash())                     final_info_hash = final_info_hash.upper()                     torfile = lt.create_torrent(torinfo)                     torcontent = lt.bencode(torfile.generate())                     tfile_size = len(torcontent)                     try:                         insert_cursor = db_conn.cursor()# prepare cursor object using cursor() method                         insert_cursor.execute("""insert dht_tfiles (hash, tdata) values (%s, %s)""",  [final_info_hash , torcontent] )                         db_conn.commit()                         #print "data inserted in db"                     except mysqldb.error, e:                         try:                             print "mysql error [%d]: %s" % (e.args[0], e.args[1])                         except indexerror:                             print "mysql error: %s" % str(e)                           shutil.rmtree(h.save_path())    #   remove temp data directory                     session.remove_torrent(h) # remove torrnt handle session                        handles.remove(h) #remove handle list                  else:                     if(h.status().active_time > 600):   # check if handle more 10 minutes old i.e. 600 seconds                         #print('remove_torrent')                         shutil.rmtree(h.save_path())    #   remove temp data directory                         session.remove_torrent(h) # remove torrnt handle session                            handles.remove(h) #remove handle list                 sleep(1)                         #print('sleep1')          #print('sleep10')         #sleep(10)         current_page = current_page + 1          #save session state         filewrite = open(session_save_filename, "wb")         filewrite.write(lt.bencode(session.save_state()))         filewrite.close()       print('sleep60')     sleep(60)      #save session state     filewrite = open(session_save_filename, "wb")     filewrite.write(lt.bencode(session.save_state()))     filewrite.close() 

i tried kept above script running overnight , found around 1200 torrent's meta data found in overnight session. looking improve performance of script.

i have tried decoding save_state file , noticed there 700+ dht nodes connected to. not dht not running,

what planning is, keep handles active in session indefinitely while meta data not fetched. , not going remove handles after 10 minutes if no meta data fetched in 10 minutes, doing it.

i have few questions regarding lib-torrent python bindings.

  1. how many handles can keep running ? there limit running handles ?
  2. will running 10k+ or 100k handles slow down system ? or eat resources ? if yes resources ? mean ram , network ?
  3. i behind firewall , can blocked incoming port causing slow speed of metadata fetching ?
  4. can dht server router.bittorrent.com or other ban ip address sending many requests ?
  5. can other peers ban ip address if find out making many requests fot fetching meta data ?
  6. can run multiple instances of script ? or may multi-threading ? give better performance ?
  7. if using multiple instances of same script, each script unique node-id depending on ip , port using , viable solution ?

is there better approach ? achieving trying ?

i can't answer questions specific libtorrent's apis, of questions apply bittorrent in general.

will running 10k+ or 100k handles slow down system ? or eat resources ? if yes resources ? mean ram , network ?

metadata-downloads shouldn't use resources since not full torrent-downloads yet, i.e. can't allocate actual files or that. need ram/disk space metadata once grab first chunk of those.

i behind firewall , can blocked incoming port causing slow speed of metadata fetching ?

yes, reducing number of peers can establish connections becomes more difficult fetch metadata (or establish connection @ all) on swarms low peer count.

nats can cause same issue.

can dht server router.bittorrent.com or other ban ip address sending many requests ?

router.bittorrent.com bootstrap node, not server per se. lookups don't query single node, query many different (among millions). yes, individual nodes can ban, or more rate-limit, you.

this can mitigated looking randomly distributed ids spread load across dht keyspace.

can run multiple instances of script ? or may multi-threading ? give better performance ?

aiui libtorrent sufficiently non-blocking or multi-threaded can schedule many torrents @ once.

i don't know if libtorrent has rate-limit outgoing dht requests.

if using multiple instances of same script, each script unique node-id depending on ip , port using , viable solution ?

if mean dht node id, they're derived ip (as per bep 42), not port. although random element included, limited amount of ids can obtained per ip.

and of might might applicable scenario: http://blog.libtorrent.org/2012/01/seeding-a-million-torrents/

and option my own dht implementation includes cli bulk-fetch torrents.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -