linkedin-python-scrapy-scraper

Python Scrapy spiders that scrape job data & people and company profiles from LinkedIn.com.

This Scrapy project contains 3 separate spiders:

Spider	Description
`linkedin_people_profile`	Scrapes people data from LinkedIn people profile pages.
`linkedin_jobs`	Scrapes job data from LinkedIn (https://www.linkedin.com/jobs/search)
`linkedin_company_profile`	Scrapes company data from LinkedIn company profile pages.

The following articles go through in detail how these LinkedIn spiders were developed, which you can use to understand the spiders and edit them for your own use case.

ScrapeOps Proxy

This LinkedIn spider uses ScrapeOps Proxy as the proxy solution. ScrapeOps has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.

You can sign up for a free API key here.

⚠️ Important: You will need to first validate your email and sometimes your phone number before your API key will work in this project!

ScrapeOps Proxy Cost

The cost for one LinkedIn request is 30 credits. Our free plan offers 1000 credits so you would have only have enough to make 33 valid requests to linkedin using the free plan. Paid plans start at $9 per month for 25k credits (approx. 833 LinkedIn requests).

Project installation

Follow the following steps exactly to get your project set up & running.

Clone this repo in your project folder (we presume you already have git installed!)

git clone https://github.com/python-scrapy-playbook/linkedin-python-scrapy-scraper

Set up a virtual environment. Go into the downloaded project and create a new virtual environment.

cd linkedin-python-scrapy-scraper
python -m venv venv

Activate the virtual environment

Mac/Linux:

source venv/bin/activate

Windows (Command Prompt):

venv\Scripts\activate

Windows (PowerShell):

venv\Scripts\Activate.ps1

Install scrapy & the scrapeops proxy middleware

pip install scrapy scrapeops-scrapy

pip install scrapeops-scrapy-proxy-sdk

Add your ScrapeOps API key. If you don't have one already you can sign up for a free API key here.

Add your API key to the SCRAPEOPS_API_KEY in the projects settings.py file.

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

You should now be able to run the scraper you would like

scrapy crawl linkedin_people_profile

scrapy crawl linkedin_company_profile

scrapy crawl linkedin_jobs

Customizing The LinkedIn People Profile Scraper

The following are instructions on how to modify the LinkedIn People Profile scraper for your particular use case.

Check out this guide to building a LinkedIn.com Scrapy people profile spider if you need any more information.

Configuring LinkedIn People Profile Search

To change the query parameters for the people profile search just change the profiles in the profile_list lists in the spider.

For example:

def start_requests(self):
    profile_list = ['reidhoffman', 'other_person']
    for profile in profile_list:
        linkedin_people_url = f'https://www.linkedin.com/in/{profile}/' 
        yield scrapy.Request(url=linkedin_people_url, callback=self.parse_profile, meta={'profile': profile, 'linkedin_url': linkedin_people_url})

Extract More/Different Data

LinkedIn People Profile pages contain a lot of useful data. This spider extracts:

Profile Info: name, description, location, followers, connections, about
Experience: title, company, organisation_profile, location, description, start_time, end_time, duration
Education: organisation, organisation_profile, course_details, description, start_time, end_time
Volunteering: role, organisation, organisation_profile, cause, description, start_time, end_time, duration
Skills: list of skill names
Recommendations: recommender_name, recommender_profile, content

You can expand or change the data that gets extracted by adding additional parsers and adding the data to the item that is yielded in the parse_profiles method:

Speeding Up The Crawl

The spiders are set to only use 1 concurrent thread in the settings.py file as the ScrapeOps Free Proxy Plan only gives you 1 concurrent thread.

However, if you upgrade to a paid ScrapeOps Proxy plan you will have more concurrent threads. Then you can increase the concurrency limit in your scraper by updating the CONCURRENT_REQUESTS value in your settings.py file.

# settings.py

CONCURRENT_REQUESTS = 10

Storing Data

The spiders are set to save the scraped data into a JSON file and store it in a data folder using Scrapy's Feed Export functionality.

custom_settings = {
        'FEEDS': { 'data/%(name)s_%(time)s.jsonl': { 'format': 'jsonlines',}}
        }

If you would like to save your files to a AWS S3 bucket then check out our Saving CSV/JSON Files to Amazon AWS S3 Bucket guide here

Or if you would like to save your data to another type of database then be sure to check out these guides:

Deactivating ScrapeOps Proxy & Monitor

To deactivate the ScrapeOps Proxy & Monitor simply comment out the follow code in your settings.py file:

# settings.py

# SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

# SCRAPEOPS_PROXY_ENABLED = True

# EXTENSIONS = {
# 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
# }

# DOWNLOADER_MIDDLEWARES = {

#     ## ScrapeOps Monitor
#     'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
#     'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    
#     ## Proxy Middleware
#     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
# }

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
linkedin		linkedin
.gitignore		.gitignore
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

linkedin-python-scrapy-scraper

ScrapeOps Proxy

ScrapeOps Proxy Cost

Project installation

Customizing The LinkedIn People Profile Scraper

Configuring LinkedIn People Profile Search

Extract More/Different Data

Speeding Up The Crawl

Storing Data

Deactivating ScrapeOps Proxy & Monitor

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

python-scrapy-playbook/linkedin-python-scrapy-scraper

Folders and files

Latest commit

History

Repository files navigation

linkedin-python-scrapy-scraper

ScrapeOps Proxy

ScrapeOps Proxy Cost

Project installation

Customizing The LinkedIn People Profile Scraper

Configuring LinkedIn People Profile Search

Extract More/Different Data

Speeding Up The Crawl

Storing Data

Deactivating ScrapeOps Proxy & Monitor

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages