Successfully scrape data from any website with the power of Python
This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.
Chapter 1: Introduction to Web Scraping 1About This Book
- A hands-on guide to web scraping with real-life problems and solutions
- Techniques to download and extract data from complex websites
- Create a number of different web scrapers to extract information
Who This Book Is For
This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved.What You Will Learn
- Extract data from web pages with simple Python programming
- Build a threaded crawler to process web pages in parallel
- Follow links to crawl a website
- Download cache to reduce bandwidth
- Use multiple threads and processes to scrape faster
- Learn how to parse JavaScript-dependent websites
- Interact with forms and sessions
- Solve CAPTCHAs on protected web pages
- Discover how to track the state of a crawl
In Detail
The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming.This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.
Style and approach
This book is a hands-on guide with real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.When is web scraping useful? 1
Is web scraping legal? 2
Background research 2
Checking robots.txt 3
Examining the Sitemap 4
Estimating the size of a website 4
Identifying the technology used by a website 6
Finding the owner of a website 6
Crawling your first website 7
Downloading a web page 8
Retrying downloads 8
Setting a user agent 10
Sitemap crawler 11
ID iteration crawler 11
Link crawler 14
Advanced features 16
Summary 20
Chapter 2: Scraping the Data 21
Analyzing a web page 22
Three approaches to scrape a web page 24
Regular expressions 24
Beautiful Soup 26
Lxml 27
CSS selectors 28
Comparing performance 29
Scraping results 30
Overview 32
Adding a scrape callback to the link crawler 32
Summary 34
Chapter 3: Caching Downloads 35
Adding cache support to the link crawler 35
Disk cache 37
Implementation 39
Testing the cache 40
Saving disk space 41
Expiring stale data 41
Drawbacks 43
Database cache 44 What is NoSQL? 44 Install MongoDB 44
Overview of MongoDB 45
MongoDB cache implementation 46
Compression 47
Testing the cache 48
Summary 48
Chapter 4: Concurrent Downloading 49
One million web pages 49
Parsing the Alexa list 50
Sequential crawler 51
Threaded crawler 52
How threads and processes work 52
Implementation 53
Cross-process crawler 55
Performance 58
Summary 59
Chapter 5: Dynamic Content 61
An example dynamic web page 62
Reverse engineering a dynamic web page 64
Edge cases 67
Rendering a dynamic web page 69
PyQt or PySide 69
Executing JavaScript 70
Website interaction with WebKit 72
Waiting for results 73
The Render class 74
Selenium 76
Summary 78
Chapter 6: Interacting with Forms 79
The Login form 80
Loading cookies from the web browser 83
Extending the login script to update content 87
Automating forms with the Mechanize module 90
Summary 91
Chapter 7: Solving CAPTCHA 93
Registering an account 94
Loading the CAPTCHA image 95
Optical Character Recognition 96
Further improvements 100
Solving complex CAPTCHAs 100
Using a CAPTCHA solving service 101
Getting started with 9kw 102
9kw CAPTCHA API 103
Integrating with registration 108
Summary 109
Chapter 8: Scrapy 111
Installation 111
Starting a project 112
Defining a model 113
Creating a spider 114
Tuning settings 115
Testing the spider 116
Scraping with the shell command 117
Checking results 118
Interrupting and resuming a crawl 121
Visual scraping with Portia 122
Installation 122
Annotation 124
Tuning a spider 127
Checking results 129
Automated scraping with Scrapely 130
Summary 131
Chapter 9: Overview 133
Google search engine 133
Facebook 137
The website 138
The API 139
Gap 140
BMW 142
No comments:
Post a Comment