Beautiful Soup: Build a Web Scraper With Python Part 2

Gauloran

Global Moderatör
7 Tem 2013
8,188
637
Part 2: Scrape HTML Content From a Page
Now that you have an idea of what you’re working with, it’s time to get started using Python. First, you’ll want to get the site’s HTML code into your Python script so that you can interact with it. For this task, you’ll use Python’s requests library. Type the following in your terminal to install it:

$ pip3 install requests
Then open up a new file in your favorite text editor. All you need to retrieve the HTML are a few lines of code:

import requests

Kod:
URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)
This code performs an HTTP request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

If you take a look at the downloaded content, then you’ll notice that it looks very similar to the HTML you were inspecting earlier with developer tools. To improve the structure of how the HTML is displayed in your console output, you can print the object’s .content attribute with pprint().

Static Websites
The website you’re scraping in this tutorial serves static HTML content. In this scenario, the server that hosts the site sends back HTML ********s that already contain all the data you’ll get to see as a user.

When you inspected the page with developer tools earlier on, you discovered that a job posting consists of the following long and messy-looking HTML:

Kod:
<section class="card-content" data-jobid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" onclick="MKImpressionTrackingMouseDownHijack(this, event)">
<div class="flex-row">
<div class="mux-company-logo thumbnail"></div>
<div class="summary">
<header class="card-header">
<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="4" data-m_impr_j_coc="" data-m_impr_j_jawsid="371676273" data-m_impr_j_jobid="0" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="30.1882" data-m_impr_j_lid="619" data-m_impr_j_long="-95.6732" data-m_impr_j_occid="11838" data-m_impr_j_p="3" data-m_impr_j_postingid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" data-m_impr_j_pvc="4496dab8-a60c-4f02-a2d1-6213320e7213" data-m_impr_s_t="t" data-m_impr_uuid="0b620778-73c7-4550-9db5-df4efad23538" href="https://job-openings.monster.com/python-developer-woodlands-wa-us-lancesoft-inc/4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" onclick="clickJobTitle('plid=619&pcid=4&poccid=11838','Software Developer',''); clickJobTitleSiteCat('{"events.event48":"true","eVar25":"Python Developer","eVar66":"Monster","eVar67":"JSR2CW","eVar26":"_LanceSoft Inc","eVar31":"Woodlands_WA_","prop24":"2019-07-02T12:00","eVar53":"1500127001001","eVar50":"Aggregated","eVar74":"regular"}')">Python Developer
</a></h2>
</header>
<div class="company">
<span class="name">LanceSoft Inc</span>
<ul class="list-inline">
</ul>
</div>
<div class="********">
<span class="name">
Woodlands, WA
</span>
</div>
</div>
<div class="**** flex-col">
<time datetime="2017-05-26T12:00">2 days ago</time>
<span class="mux-tooltip applied-only" data-mux="tooltip" title="Applied">
<i aria-hidden="true" class="icon icon-applied"></i>
<span class="sr-only">Applied</span>
</span>
<span class="mux-tooltip saved-only" data-mux="tooltip" title="Saved">
<i aria-hidden="true" class="icon icon-saved"></i>
<span class="sr-only">Saved</span>
</span>
</div>
</div>
</section>
It can be difficult to wrap your head around such a long block of HTML code. To make it easier to read, you can use an HTML formatter to automatically clean it up a little more. Good readability helps you better understand the structure of any code block. While it may or may not help to improve the formatting of the HTML, it’s always worth a try.

Note: Keep in mind that every website will look different. That’s why it’s necessary to inspect and understand the structure of the site you’re currently working with before moving forward.

The HTML above definitely has a few confusing parts in it. For example, you can scroll to the right to see the large number of attributes that the <a> element has. Luckily, the class names on the elements that you’re interested in are relatively straightforward:

Kod:
class="title": the title of the job posting
class="company": the company that offers the position
class="********": the ******** where you’d be working
In case you ever get lost in a large pile of HTML, remember that you can always go back to your browser and use developer tools to further explore the HTML structure interactively.

By now, you’ve successfully harnessed the power and user-friendly design of Python’s requests library. With only a few lines of code, you managed to scrape the static HTML content from the web and make it available for further processing.

However, there are a few more challenging situations you might encounter when you’re scraping websites. Before you begin using Beautiful Soup to pick the relevant information from the HTML that you just scraped, take a quick look at two of these situations.

Hidden Websites
Some pages contain information that’s hidden behind a login. That means you’ll need an account to be able to see (and scrape) anything from the page. The process to make an HTTP request from your Python script is different than how you access a page from your browser. That means that just because you can log in to the page through your browser, that doesn’t mean you’ll be able to scrape it with your Python script.

However, there are some advanced techniques that you can use with the requests to access the content behind logins. These techniques will allow you to log in to websites while making the HTTP request from within your script.

Dynamic Websites
Static sites are easier to work with because the server sends you an HTML page that already contains all the information as a response. You can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.

On the other hand, with a dynamic website the server might not send back any HTML at all. Instead, you’ll receive JavaScript code as a response. This will look completely different from what you saw when you inspected the page with your browser’s developer tools.

Note: To offload work from the server to the clients’ machines, many modern websites a**** crunching numbers on their servers whenever possible. Instead, they’ll send JavaScript code that your browser will execute locally to produce the desired HTML.

As mentioned before, what happens in the browser is not related to what happens in your script. Your browser will diligently execute the JavaScript code it receives back from a server and create the DOM and HTML for you locally. However, doing a request to a dynamic website in your Python script will not provide you with the HTML page content.

When you use requests, you’ll only receive what the server sends back. In the case of a dynamic website, you’ll end up with some JavaScript code, which you won’t be able to parse using Beautiful Soup. The only way to go from the JavaScript code to the content you’re interested in is to execute the code, just like your browser does. The requests library can’t do that for you, but there are other solutions that can.

For example, requests-html is a project created by the author of the requests library that allows you to easily render JavaScript using syntax that’s similar to the syntax in requests. It also includes capabilities for parsing the data by using Beautiful Soup under the hood.

Note: Another popular choice for scraping dynamic content is Selenium. You can think of Selenium as a slimmed-down browser that executes the JavaScript code for you before passing on the rendered HTML response to your script.

You won’t go deeper into scraping dynamically-generated content in this tutorial. For now, it’s enough for you to remember that you’ll need to look into the above-mentioned options if the page you’re interested in is generated in your browser dynamically.

this post is quoted​
 
Üst

Turkhackteam.org internet sitesi 5651 sayılı kanun’un 2. maddesinin 1. fıkrasının m) bendi ile aynı kanunun 5. maddesi kapsamında "Yer Sağlayıcı" konumundadır. İçerikler ön onay olmaksızın tamamen kullanıcılar tarafından oluşturulmaktadır. Turkhackteam.org; Yer sağlayıcı olarak, kullanıcılar tarafından oluşturulan içeriği ya da hukuka aykırı paylaşımı kontrol etmekle ya da araştırmakla yükümlü değildir. Türkhackteam saldırı timleri Türk sitelerine hiçbir zararlı faaliyette bulunmaz. Türkhackteam üyelerinin yaptığı bireysel hack faaliyetlerinden Türkhackteam sorumlu değildir. Sitelerinize Türkhackteam ismi kullanılarak hack faaliyetinde bulunulursa, site-sunucu erişim loglarından bu faaliyeti gerçekleştiren ip adresini tespit edip diğer kanıtlarla birlikte savcılığa suç duyurusunda bulununuz.