好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

asp.net c#如何采集需要登录的页面?

asp.net c#如何采集需要登录的页面?

首先说明:代码片段是从网络获取,然后自己修改。我想好的东西应该拿来分享。

先说下原理:当我们采集页面的时候,如果被采集的网站需要登录才能采集。不管是基于Cookie还是基于Session,我们都会首先发送一个Http请求头,这个Http请求头里面就包含了网站需要的Cookie信息。当网站接收到发送过来的Http请求头时,会从Http请求头获取相关的Cookie或者Session信息,然后由程序来处理,决定你是否有权限访问当前页面。

好了,原理搞清楚了,就好办了。我们所要做的仅仅是在采集的时候(或者说HttpWebRequest提交数据的时候),将Cookie信息放入Http请求头里面就可以了。

在这里我提供2种方法。

第一种,直接将Cookie信息放入HttpWebRequest的CookieContainer里。看代码:

 protected   void  Page_Load( object   sender, EventArgs e)
        {
              //  设置Cookie,存入Hashtable 
            Hashtable ht =  new   Hashtable();
            ht.Add(  "  username  " ,  "  youraccount  "  );
            ht.Add(  "  id  " ,  "  yourid  "  );
              this  .Collect(ht);
        }
          public   void   Collect(Hashtable ht)
        {
              string  content =  string  .Empty;
              string  url =  "  http://HdhCmsTestibest100测试数据/需要登录后才能采集的页面  "  ; 
             string  host =  "  http://HdhCmsTestibest100测试数据  "  ;
              try  
            {
                  //  获取提交的字节 
                 byte [] bs =  Encoding.UTF8.GetBytes(content);
                  //  设置提交的相关参数 
                HttpWebRequest req =  (HttpWebRequest)HttpWebRequest.Create(url);
                req.Method  =  "  POST  "  ;
                req.ContentType  =  "  application/json;charset=utf-8  "  ;
                req.ContentLength  =  bs.Length;
                  //  将Cookie放入CookieContainer,然后再将CookieContainer添加到HttpWebRequest 
                CookieContainer cc =  new   CookieContainer();
                cc.Add(  new  Uri(host),  new  Cookie( "  username  " , ht[ "  username  "  ].ToString()));
                cc.Add(  new  Uri(host),  new  Cookie( "  id  " , ht[ "  id  "  ].ToString()));
                req.CookieContainer  =  cc;
                  //  提交请求数据 
                Stream reqStream =  req.GetRequestStream();
                reqStream.Write(bs,   0  , bs.Length);
                reqStream.Close();
                  //  接收返回的页面,必须的,不能省略 
                WebResponse wr =  req.GetResponse();
                System.IO.Stream respStream  =  wr.GetResponseStream();
                System.IO.StreamReader reader  =  new  System.IO.StreamReader(respStream, System.Text.Encoding.GetEncoding( "  utf-8  "  ));
                  string  t =  reader.ReadToEnd();
                System.Web.HttpContext.Current.Response.Write(t);
                wr.Close();
            }
              catch   (Exception ex)
            {
                System.Web.HttpContext.Current.Response.Write(  "  异常在getPostRespone:  "  + ex.Source +  "  :  "  +  ex.Message);
            }

        } 

第二种,每次打开采集程序时,需要先到被采集的网站模拟登录一次,获取CookieContainer,然后再采集。看代码:

 protected   void  Page_Load( object   sender, EventArgs e)
        {
              try  
            {
                CookieContainer cookieContainer  =  new   CookieContainer();
                  string  formatString =  "  username={0}&password={1}  " ; //  *************** 
                 string  postString =  string .Format(formatString,  "  youradminaccount  " ,  "  yourpassword  "  );
                  //  将提交的字符串数据转换成字节数组 
                 byte [] postData =  Encoding.UTF8.GetBytes(postString);
                  //  设置提交的相关参数 
                 string  URI =  "  http://HdhCmsTestibest100测试数据/登录页面  " ; //  *************** 
                HttpWebRequest request = WebRequest.Create(URI)  as   HttpWebRequest;
                request.Method  =  "  POST  "  ;
                request.KeepAlive  =  false  ;
                request.ContentType  =  "  application/x-www-form-urlencoded  "  ;
                request.CookieContainer  =  cookieContainer;
                request.ContentLength  =  postData.Length;
                  //   提交请求数据 
                System.IO.Stream outputStream =  request.GetRequestStream();
                outputStream.Write(postData,   0  , postData.Length);
                outputStream.Close();
                  //  接收返回的页面,必须的,不能省略 
                HttpWebResponse response = request.GetResponse()  as   HttpWebResponse;
                System.IO.Stream responseStream  =  response.GetResponseStream();
                System.IO.StreamReader reader  =  new  System.IO.StreamReader(responseStream, Encoding.GetEncoding( "  gb2312  "  ));
                  string  srcString =  reader.ReadToEnd();
                  //  打开您要访问的页面 
                URI =  "  http://HdhCmsTestibest100测试数据/需要登录后才能采集的页面  " ; //  *************** 
                request = WebRequest.Create(URI)  as   HttpWebRequest;
                request.Method  =  "  GET  "  ;
                request.KeepAlive  =  false  ;
                request.CookieContainer  =  cookieContainer;
                  //   接收返回的页面 
                response = request.GetResponse()  as   HttpWebResponse;
                responseStream  =  response.GetResponseStream();
                reader  =  new  System.IO.StreamReader(responseStream, Encoding.GetEncoding( "  gb2312  "  ));
                srcString  =  reader.ReadToEnd();
                  //  输出获取的页面或者处理 
                 Response.Write(srcString);
            }
              catch   (WebException we)
            {
                  string  msg =  we.Message;
                Response.Write(msg);
            }
        } 

也许有人会问,如果对方登录的时候要验证码怎么办?那你就用第一种方式吧,只不过需要你分析对方的Cookie。

应用范围:采集数据、论坛发帖、博客发文。

 

 

 

标签:  asp.net ,  c# ,  Http头 ,  cookie

作者: Leo_wl

    

出处: http://HdhCmsTestcnblogs测试数据/Leo_wl/

    

本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

版权信息

查看更多关于asp.net c#如何采集需要登录的页面?的详细内容...

  阅读:44次