Node crawler 05/19 Update SLTechnology News&Howtos

Node crawler

2026-05-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Node is a server-side language, so you can crawl the website like python, so let's use node to crawl the blog park to get all the chapter information.

Step 1: create the crawl file, and then npm init.

Step 2: create a crawl.js file, and the code to simply crawl the entire page is as follows:

Var http = require ("http"); var url = "http://www.cnblogs.com";http.get(url, function (res) {var html ="; res.on ("data", function (data) {html + = data;}); res.on ("end", function () {console.log (html);}) }) .on ("error", function () {console.log ("error getting course results!") ;})

That is, the introduction of the http module, and then the use of the get request of the http object, that is, once run, the node server sends a get request to request the page, and then returns through res, where the on binding data event is used to constantly receive data, and finally we print it out in the background when we end.

This is only part of the whole page, where we can review the elements and find that they are indeed the same.

We just need to crawl the chapter title and the information in each section to it.

Step 3: introduce the cheerio module, as follows: (you can install it in gitbash, cmd always goes wrong)

Cnpm install cheerio-save-dev

The introduction of this module is to facilitate us to operate dom, just like jQuery.

Step 4: operate dom to get useful information.

Var http = require ("http"); var cheerio = require ("cheerio"); var url = "http://www.cnblogs.com";function filterData (html) {var $= cheerio.load (html); var items = $(" .post _ item "); var result = []; items.each (function (item) {var tit = $(this). Find (" .titlelnk "). Text () Var aut = $(this). Find (".lightblue"). Text (); var one = {title: tit, author: aut}; result.push (one);}); return result } function printInfos (allInfos) {allInfos.forEach (function (item) {console.log ("article title" + item ["title"] +'\ nwriter "+ item [" author "] +'\ nwriter +'\ n');} http.get (url, function (res) {var html ="; res.on ("data", function (data) {html + = data;})) Res.on ("end", function (data) {var allInfos = filterData (html); printInfos (allInfos);});}) .on ("error", function () {console.log ("failed to climb the home page of the blog park")})

That is, the above process is to crawl the title and author of the blog.

The final backend output is as follows:

This is consistent with the content of the front page of the blog park:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.