Recently one of our clients contacted us to extract data from amazon, the required data was available on amazon but was not easy to extract.
Client wanted to do some analysis and needed top 1000 products is some categories by the amazon sales rank. For example lets assume that he wanted this for the Women Wrist Watches category. Getting top 100 watches by sales rank was not a problem as this is easily available on amazon at this url https://www.amazon.com/gp/bestsellers/watches/6358544011/ref=pd_zg_hrsr_watch_1_3_last . The real challenge was how to get the remaining 900 watches that rank from 101 to 1000. Amazon does not list more than 100 products (top products by sales rank) on their website.
We tried some other workaround to get top 1000 products. There were basically two techniques.
1: Direct data scraping from amazon.com
We tried to get the maximum number of products from Women Wrist Watches category at this url https://www.amazon.com/Womens-Wrist-Watches/b/ref=dp_bc_4?ie=UTF8&node=6358544011
. Scraped products from all the pages and also applied some filters to get more products. We extracted all the relevant information from each product including their sales rank, and saved all in a database. Than these products in the database were sorted by the sales rank to get top products.
We were able to get the required products using this technique but we realized that this will require scraping a very large number of products and doing this for many other categories can be time consuming so we tried another technique.
2: Using google search to scrape data
We decided to use google search to make the scraping faster and minimize the number of requests we send to amazon. For example look at this text below , this is for the products that is ranked #17 in women's wrist watches.
#17 in Watches > Women > Wrist Watches
If we have to find this product using google we can enter the search query ( "#17 in Watches > Women > Wrist Watches" site:amazon.com ) and google will give us this product in the search results. So we wrote a script that will make this search on google for #101 to #1000, for example
"#173 in Watches > Women > Wrist Watches" site:amazon.com
"#174 in Watches > Women > Wrist Watches" site:amazon.com
and so on....
Google was returning results for almost every number but there was one problem , for some products the sales rank on amazon was different than show on google search, this was because amazon changed the sales rank after google indexed the amazon product page. However difference of the rank was not big, so we decided to increase the search and do it for numbers from #101 to #1500 , We scraped these 1500 amazon products and saved in the database, and now when we sorted them by amazon sales rank, results were better, we got more than 90% of the products out of top 1000 products.
What we can learn from this small project is that even if some information is not easily available on a website we can still use some workarounds to get what we want, doing all this manually can take very very long time and it does not make sense to spend long time on some thing that can be automated using the script. If you want to scrape data from amazon or any other website, contact us and we will be glad to help you get the data you need.