Botsol Blog

Blog about web scraping and web bots

How to extract ANY information from websites

If you have a list of websites and want to get Contact details or any other piece of information from these websites, Botsol’s Web Extractor can help you.

It has built-in features to extract Email and Social Media Links, Users can extract any other information by doing a few simple actions.

In this example we will extract the Title and the Meta Description from a list of websites, It will already extract the email and social media links by default.

Here is how to configure the app to extract this information.

Download and install the Botsol Web Extractor app from here https://www.botsol.com/bots/web-extractor

Run the Botsol Web Extractor application.

Click Options and select “Add/Customize Data Fields” , It will open a new window.

Click the “Add New Item” button , Enter the name of your new field, Select the type (Xpath or Regex) here we will use Xpath for our required fields.

Heading has Xpath //h1

Title tag has the Xpath //title

Meta Description’s Xpath will be //meta[@name=’description’]/@content

As you can see in the screenshot above, we had added two data fields. Now close this window.

Past all your urls in the text area showing on the botsol web extractor app, and click the “Start Bot” button.

It will visit each page and extract contact info along with the title and meta description. By default the app visits the URLs in background, but can also open URLs in chrome browser if you want, Click Options> Settings and select the option to open URLs in chrome browser, this is helpful for websites that use heavy java scripts to show content.

That’s it, it’s really simple and fast to extract any information from a url, User can export the data to CSV/Excel when it’s done.

Read more about Xpath (https://www.w3schools.com/xml/xpath_syntax.asp) and regex (http://www.rexegg.com/regex-quickstart.html).

 

Add comment

Loading