HttpClient scrape data from website with login c# -
i scrape data following website:
the website contains data table tennis. actual season can accessed without login last seasons login. actual season have created code data out of , works fine. using httpclient htmlagilitypack. code this:
httpclient http = new httpclient(); var response = await http.getbytearrayasync(website); string source = encoding.getencoding("utf-8").getstring(response, 0, response.length - 1); source = webutility.htmldecode(source); htmldocument resultat = new htmldocument(); resultat.loadhtml(source); relevant data resultat scanning documentnodes resultat... now fetch data website needs login. has idea how login website , data? login must done clicking on "ergebnishistorie freischalten ..." , entering username , passwort.
there many ways perform login website, , depends on authentication method used specific site (forms authentication, basic authentication, windows authetication etc.). websites use formsauthentication.
to perform login in standard formsauthentication website using httpclient, need set cookiecontainer, because authentication data set on cookies.
in specific example, login form makes post of page in https, used https://wttv.click-tt.de/cgi-bin/webobjects/nuligattde.woa/wa/teamportrait?teamtable=1673669&pagestate=rueckrunde&championship=sk+bez.+bb+13%2f14&group=204559 example. code make request using httpclient:
var baseaddress = new uri("https://wttv.click-tt.de/"); var cookiecontainer = new cookiecontainer(); using (var handler = new httpclienthandler() { cookiecontainer = cookiecontainer }) using (var client = new httpclient(handler) { baseaddress = baseaddress }) { //usually make standard request without authentication, eg: home page. //by doing request store initial cookie values, might used in subsequent login request , checked server var homepageresult = client.getasync("/"); homepageresult.result.ensuresuccessstatuscode(); var content = new formurlencodedcontent(new[] { //the name of form values must name of <input /> tags of login form, in case tag <input type="text" name="username"> new keyvaluepair<string, string>("username", "username"), new keyvaluepair<string, string>("password", "password"), }); var loginresult = client.postasync("/cgi-bin/webobjects/nuligattde.woa/wa/teamportrait?teamtable=1673669&pagestate=rueckrunde&championship=sk+bez.+bb+13%2f14&group=204559", content).result; loginresult.ensuresuccessstatuscode(); //make subsequent web requests using same httpclient object } however, many websites uses javascript loaded form values or more captcha controls, , solution not work. might done said webbrowser control (by automating user input on form fields , login button click, link has example: https://social.msdn.microsoft.com/forums/vstudio/en-us/0b77ca8c-48ce-4fa8-9367-c7491aa359b0/yahoo-login-via-systemnetsockets-namespace?forum=vbgeneral).
as general rule inspect how works login on desidered website, use fiddler: http://www.telerik.com/fiddler: when click login button on website, watch fiddler , find login request (usually it's first request after click "login" button, , post request).
then inspect request data (select request , go "inspectors" - "textview" tab) , try replicate request on code.
on left pane there requests intercepted fiddler, on right pane there request , response inspectors (on top there request inspectors, on bottom there response inspectors)
edit
same code old webrequest class: http://rextester.com/llp86817
var cookiecontainer = new cookiecontainer(); httpwebrequest request = (httpwebrequest)httpwebrequest.create("https://wttv.click-tt.de/"); request.cookiecontainer = cookiecontainer; //set user agent , accept header values, simulate real web browser request.useragent = "mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, gecko) chrome/45.0.2454.101 safari/537.36"; request.accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"; //set automatic decompression request.automaticdecompression = decompressionmethods.deflate | decompressionmethods.gzip; console.writeline("first response"); console.writeline(); using (webresponse response = request.getresponse()) { using (streamreader sr = new streamreader(response.getresponsestream())) { console.writeline(sr.readtoend()); } } request = (httpwebrequest)httpwebrequest.create("https://wttv.click-tt.de/cgi-bin/webobjects/nuligattde.woa/wa/teamportrait?teamtable=1673669&pagestate=rueckrunde&championship=sk+bez.+bb+13%2f14&group=204559"); //set cookie container object request.cookiecontainer = cookiecontainer; request.useragent = "mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, gecko) chrome/45.0.2454.101 safari/537.36"; request.accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"; //set method post , content type application/x-www-form-urlencoded request.method = "post"; request.contenttype = "application/x-www-form-urlencoded"; //set automatic decompression request.automaticdecompression = decompressionmethods.deflate | decompressionmethods.gzip; //insert username , password string data = string.format("username={0}&password={1}", "username", "password"); byte[] bytes = system.text.encoding.utf8.getbytes(data); request.contentlength = bytes.length; using (stream datastream = request.getrequeststream()) { datastream.write(bytes, 0, bytes.length); datastream.close(); } console.writeline("login response"); console.writeline(); using (webresponse response = request.getresponse()) { using (streamreader sr = new streamreader(response.getresponsestream())) { console.writeline(sr.readtoend()); } } //request = (httpwebrequest)httpwebrequest.create("internal protected page address"); //after successful login, must use same cookie container request //request.cookiecontainer = cookiecontainer; //.... 
Comments
Post a Comment