Web Scraping Blogs Posts using Node.js

In this article, we will see how to scrape medium blogs using node.js. Web Scraping Blogs Posts using Node.js

Recent Node.js article

Building P2P Video Chat Application using webRTC and Node.js

Apache Kafka for Node.js Developers

Set up

  • Express - we will be using express to show the scrap results in the browser.
  • Request - it is used to make API calls to medium blogs to get the data.
  • Cheerio - it is used to manipulate the DOM in the response data from the URL. consider it just like JQuery.
  • Handlebars - View engine to render the web pages in the express application.

Let's set up the project to scrape medium blogs. Create a Project directory.

1$ mkdir nodescraper
2$ cd nodescraper
3$ npm init --yes

Install all the dependencies mentioned above.

1$ npm install express request cheerio express-handlebars

Getting Blog posts from Medium

we will be scraping blog posts based on the tag of it. Medium provides a search bar where we can search for blogs based on tags.

we are going to use it to scrape all the blog posts for a particular tag.

For Example, if you are going to scrape node.js blog post in the medium.you can search through the url https://medium.com/search?q=node.js .

After that, open the Inspector in chrom dev tools and see the DOM elements of it.

nodescraper

If you see it carefully, it has a pattern. we can scrap it using the element class names.

Firstly, get the webpage elements using request package.

1request(`https://medium.com/search?q=${tag}`, (err, response, html) => {
2 //returns all elements of the webpage
3})

Once you get the data, load the data to cheerio to scrap the data that you need.

1const $ = cheerio.load(html)

This loads the data to the dollar variable. if you have used JQuery before, you know the reason why we are using \$ here(Just to follow some old school naming convention).

Now, you can traverse through the DOM tree.

Since we need only the title and link for the blog posts on the page. we will get the elements in the HTML using either the class name of it or class name of the parent element.

Firstly, we need to get all the blogs DOM which has .js-block as a class name.

1$(".js-block").each((i, el) => {
2 //This is the Class name for all blog posts DIV.
3})

Most Importantly, each keyword loops through all the element which has the class name as js-block.

Secondly, we scrap the title and link of each blog posts.

1$(".js-block").each((i, el) => {
2 const title = $(el)
3 .find("h3")
4 .text()
5 const article = $(el)
6 .find(".postArticle-content")
7 .find("a")
8 .attr("href")
9
10 let data = {
11 title,
12 article,
13 }
14
15 console.log(data)
16})

This will scrap the blogs posts for a given tag.

Meanwhile, we will wrap this functionality with express application which takes a tag name as input and returns blogs for the particular tag.

Complete Source code can be found here

To Read More

Kubernetes for Nodejs developers

Do you keep hearing the word kubernetes in the tech community and you couldn't u...

TypeScript Interfaces vs Types

In this article, we will see what are interfaces and types and the difference be...

How to find project ideas to practi...

Ever wondered what how to get a real world experience on web development while w...

Building a Production-grade Nodejs,...

This article is the first part of building a production grade nodejs,graphql and...

Modern React Redux Tutorials with R...

This tutorial explain how you can build an application using modern react redux ...

Building a Piano with React Hooks

In this article, we will see how to build a piano with react hooks. Building a P...

TypeScript Basics - The Definitive ...

In this article, we will learn some basics of typescript which helps you to deve...

Here's why podman is more secured t...

In this article we will see about podman and why it is more secured way to run c...

What is gRPC ? How to implement gRP...

Everyone talks about gRPC. Have you ever wonder how it works or how to implement...