This is planned to be a two-part article: this part 1 dives into the subject of how to obtain the data for further analysis, e.g. a “corpus” of job offers / job descriptions that reflect the current state of affairs in the domain you are interested in.
As example, I will concentrate on in leadership jobs in the finance domain as subjects of interest. And, if possible, I will focus on jobs in or around my home town of Düsseldorf. But I will try to keep the code easily adjustable to other scenarios in terms of domain and geographic location.
So the 1st part is essentially “web scraping” with a focus on career development websites. That’s the “blood, sweat & tears”-part, not being by itself “analytical”, but providing the basis to build further analysis on.
The 2nd part will look into how to extract the sought-after insights from the collected data. It’s the truely “analytical” part. As we are collecting text data (e.g. job offers / job profiles / job descriptions), we will be entering the analytical field of NLP (natural language processing).
And as our goal is to identify certain patterns and/or communalities in a large number of different job offers, we will be using techniques for “topic extraction”. These techniques basically provide a means to synthesize topics from a collection of texts (aka the job offers as single documents, creating together the “corpus” of texts..) by creating clusters of words with a high level of co-occurence and relevance (indicating that any such found cluster addresses a common, underlying topic or, in other word, the larger trends we’re looking for)
If you already have made first contact with other machine learning techniques, this may sound a bit like “k-means clustering”, a very popular clustering algorithm. There are some similarities – but please be aware they are NOT the same.
Outlook on part2: visualization from an LDA-analysis of job offers
The articles will link to a GitHub repository supporting the two articles. The code works as of October 2022. I say “as as of October 2022” on purpose, as e.g. the scraping part depends a lot on the HTML-structure of the website(s) scraped in the process of collecting data.
As you will see, the code structure depends on the details and conventions of the CMS (content management system) used to manage the site(s) scraped – and thus on the whimps and objectives of those adminstrating the CMS.
In other words: the site structure may change at any moment, rendering details of the code obsolete in the process – and there is no definitive answer how to scrape a specific site in the long rung (….and this only addresses the technical aspect and intentionally leaves out the part of potential legal ramifications !)
However, I will try to give some hints that will allow you to adjust the code to any such changes with a bit of technical understanding: if code that once ran smoothly suddenly stops working, stay calm, observe what might cause the problem & apply common sense! Look at error messages, update packages in your environment …. or learn to ignore warnings until “rien ne va plus”,
First things first: if you’re only interested in some “code recipe” and not in the discovery & thought process behind it, you can skip right to “The code” part and download the corresponding Jupyter Notebook from GitHub.
I wanted to code a tool that allows me to keep abreast with the requirements and evolutions in my personal professional field, aka Finance.
You can regard the sum of job offers as a sort of majority vote for what’s relevant at the time in a given professional field and in what direction it will likely evolve. As ususal, this will be subject to fads & trends. But overall, job offer specifications reflect rather well what companies expect in terms of qualifications in a specific field. And they show the skills and techniques that companies regard es relevant for shaping the respective field in the near and mid-term future.
And technically, it easier, both for own reading or for more advanced analysis, to have a large result set available in one file, compared to having to go thru e.g. a number of pages in 25-job-steps.
The result also lends itself to be used for ATS-automation. Hence the title of this article, as I assume that more people will have this use case than the one of monitoring broad developments in their field.
ATS (applicant tracking systems) are used by companies or recruiters to automatically track incoming applications. They use machine learning algorithms to check applications for their fit with open positions. This means that, when applying, you’re engaging in a sort of cat-and-mouse game: the better you anticipate the companies expectations (modelled in both in the job description and the alogrithm employed by the ATS), the more likely your application will make it to the next step. Analysing a great number of job offers will help you identify the “hygiene factors”, aka the standard check boxes your CV and cover letter MUST tick in order to make it past the ATS filter for the job you want. The details of “what resonates with the ATS” will, of course, vary depending on the field, the industry and the hierachical level you’re applying for.
I set out fromt the start with LinkedIn. This is a no-brainer, as it has imposed itself as THE international job site. But you can examine other job / recruitments sites. E.g. there are often specialist sites for specific professional fields or industries. It all depends on what you want or need. But LinkedIn is the go-to place if you want to start out broadly – and hence worthwhile writing code for (again: leaving out all potential legal aspects). Accordingly, the code from this article can only serve as an inspiration if looking for content from other sites: in this case, the scraping logic and details would need to be adjusted to the structure and logic of these specialist sites.
Given my personal objective of keeping abreast with larger developments in my field, other approach than job sites could also work and I will perhaps try them in the future. E.g. scraping management seminar sites and their programms may provide valuable information on specific domains and the current trends and evolutions within these domains.
Web scraping is basically having your code pretend it’s a normal user, surfing on a website via a normal browser – and saving the resulting pages (or parts thereof) in the process of this automated browsing.
The two most popular packages for doing this with Python are BeautifulSoup and Selenium.
I needed a while to understand if and how these are different. But after diving a bit in the details, the answer, as I understood it, is:
As you will see, the LinkedIn site serves its content dynamically – and hence Selenium will be used.
As I was totally new to this, I needed to get some inspirations. And found it in the following other Medium articles which explain a lot in more detail:
I did have first code running about 6 months ago – but in the meantime, LinkedIn changed the search logic, the way how it serves the information in response to URL-calls and button-clicks plus it introduced filters. The present code reflects these changes and runs as of October 2022.
The search logic change: before, LinkedIn allowed to specify a geo-location (city, state, country) in the search PLUS a radius around it. So you could e.g. enter Düsseldorf plus 100 miles – and would receive all offers for places in a radius of ca. 160 km around Düsseldorf (incidently including a lot of offers from Belgium and The Netherlands). This was completely dropped in favour of a geoID-system. Any manual search filtered on a location returns a geoID in the result page URL that you can use to define the location for which you want to search with your code. I did not fully understand the methodology behind the geoID system – but searching e.g. for “Germany” or “France” would always reveal the same geoID in the URL. And using this geoID in the code would reliably only produce results from the chosen country. The same applies if you zoom in on states or regions or individual cities. Or short: it works…. but I can’t explain 100% how 😉
The serving logic change: This will be explained in detail below. The short story: while, before, the detailed job (aka card) information was immediately loaded for all 25 jobs usually displayed on a single result page when calling the result page URL, now only the first 7 (out of a total of 25) result cards were loaded in response to the URL call. After the changes, an additional “artificial scrolling” step needed to be added to force the loading for all 25 cards and their details before being able to scrape the details.
The filter logic: similar to the geoID, the settings for additional filters are now also hard-coded into the search URL. This is relevant because I was only interested, for my case, in leadership roles. This can be set e.g. by setting the filter “job experience” (“Berufserfahrung” with German settings) to show only jobs with the level “Director / VP” or “C-Level”. It took a bit of intuition to figure out that this produces the “f_E=5%2C6” part in the search URL (….and I assume the “f_E” stands for “filter employment level”) – but it works reliably
Now finally for the site structure of LinkedIn’s search results: understanding this structure is necessary to come up with the right coding logic to access and extract the info I am interested in. The typical structure of the the result of a job search looks like this:
On the left pane (marked in green), you see a list of jobs as a result of your search. In terms of the LinkedIn-CMS, these are “cards” (….you can learn this from inspecting the code via the famous CTRL+SHFT+I keyboard combination)
The first single card is marked as “2” in the image. The right pane displays the details of the currently selected card (marked as “3”). It contains the details we are interested in (and want to write to a data frame for further analysis).
Finally, marked with “1”, you see the the overall number of results found for your search. This figure is imporant, as the LinkedIn CMS systematically only shows 25 cards at at time from the total result .
The pseudo-code for the scraping task is surprisingly simple and mainly consists of two nested loops:
Define search parameters
Log in
Loop1: Loop thru the total results in steps of 25 results per page
Loop2: Loop thru the 25 cards of the current single result page
Scroll thru the left pane of result cards to make sure
all 25 cards and their detail info are loaded (this step was made
necessary by the latest changes on the LinkedIn side)
Click on each card to display its detail info
Scrape the detail info from the card details
Transform list of scraped results to data frame
Save data frame to file
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import re as re
import time
import pandas as pd
import random
Setting of Selenium and Pandas options
options = Options()
options.add_argument("start-maximized")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
Get User/Password combination to log on to LinkedIn
USERNAME = input("Enter the username: ")
PASSWORD = input("Enter the password: ")
print(USERNAME)
print(PASSWORD)
Define search string details. This is the important part where the filter criteria for the job search are set.
[!CAUTION] In other words: this is the part of the code where you would intervene when you want to adjust the results to your personal needs.
As explained above, you would need to tinker first with manual searches yourself to find out e.g. the geoID or the further other filters that fit your purpose.
keywords4search = ['Director Finance']
s_search_loc = 'Deutschland%2C%20Österreich%20und%20die%20Schweiz'
s_geo_loc = 'geoId=91000006' # 'geoId=91000006' = DACH
s_pos_level = 'f_E=5%2C6' # only director, VP oder C-Level
The definition for the “artificial scrolling”: the seach URL is being called. But after the recent LinkedIn-CMS-changes, not all 25 card and their details are immediately loaded. The LinkedIn-CMS calls these cards “disabled ember-view job-card-containerlink job-card-listtitle”. The code checks how many of these elements were found. If the number is lower than 25, then Selenium forces the load of additional cards and their data via scrolling down ( driver.execute_script(“return arguments[0].scrollIntoView();”, card_list[len(card_list)-1]) ) and checks if now 25 are visible. The scrolling step is repeated until 25 is reached ….or 7 scrolling attempt have been made. This last check is represented via the v_while_cycles variable. Observing the results produced by the code, scrolling usually added 5 cards per scroll after the initial load of 7 card – but sometime a scrolling would not produce an new cards (probably a time-out-problem), but after 8 cycles, all cards were always visible. The print statements produce an output that lets you verify the outcome of the scrolling.
The function returns a variable card_list which is a Selenium objects containing all 25 cards AND the associated card details.
def f_artificial_scrolling(search_URL):
v_while_cycles = 0
card_list_length = 0
card_list_length_old = 0
driver.get(search_URL)
card_list = driver.find_elements(By.CLASS_NAME,'.'.join("disabled ember-view job-card-container__link job-card-list__title".split()))
card_list_length = (len(card_list))
print( f'CONTROL1: card_list_length: value = {card_list_length} and card_list_length_old: value = {card_list_length_old}')
while (card_list_length < 25) and (v_while_cycles < 8):
v_while_cycles += 1
driver.execute_script("return arguments[0].scrollIntoView();", card_list[len(card_list)-1])
card_list = driver.find_elements(By.CLASS_NAME,'.'.join("disabled ember-view job-card-container__link job-card-list__title".split()))
card_list_length_old = card_list_length
card_list_length = (len(card_list))
print( f'CONTROL2: card_list_length: value = {card_list_length} and card_list_length_old: value = {card_list_length_old}')
return (card_list)
The following function is called after an individual card has been clicked: for this card, a list is returned that contains the details of this card. For every detail, first an ‘initial_value’ is set and then a try/except run that allows to see from the result if the code was able to find the elements (aka: by.CLASS_NAME) in the card details. This is the actual content scraping !
def f_scrap_position_details():
# Details to be scrapped from the individual clicked card aka job posting
JobTitle = 'initial_value'
JobTitleLink = 'initial_value'
CompanyName = 'initial_value'
CompanyLocation = 'initial_value'
CompanySize = 'initial_value'
PositionLevel = 'initial_value'
JobDescDetails = 'initial_value'
try:
JobTitle = driver.find_element(By.CLASS_NAME,"jobs-unified-top-card__job-title").text
except:
JobTitle = "JobTitle not found"
try:
JobTitleLink = driver.find_element(By.CLASS_NAME,"jobs-unified-top-card__content--two-pane").find_element(By.TAG_NAME,"a").get_attribute('href')
except:
JobTitle = "JobTitleLink not found"
try:
CompanyName = driver.find_element(By.CLASS_NAME,"jobs-unified-top-card__company-name").find_element(By.TAG_NAME,"a").text
except:
CompanyName = "CompanyName not found"
try:
CompanyLocation = driver.find_element(By.CLASS_NAME,"jobs-unified-top-card__bullet").text
except:
CompanyLocation = "CompanyLocation not found"
JobDescElements = driver.find_element(By.CLASS_NAME,"jobs-description-content__text").find_elements(By.TAG_NAME,"span")
for element in JobDescElements:
try:
if len(element.text) > 50:
JobDescDetails = element.text
break
else:
JobDescDetails = "These are not the JobDescDetails you were looking for"
except:
JobDescDetails = "JobDescDetails not found"
try:
CompanyInsights = driver.find_elements(By.CLASS_NAME,'.'.join("jobs-unified-top-card__job-insight".split()))
PositionLevel = CompanyInsights[0].text
CompanySize = CompanyInsights[1].text
except:
PositionLevel = "PostionLevel not found"
CompanySize = "CompanySize not found"
l_return = [JobTitle, JobTitleLink, CompanyName, CompanyLocation, CompanySize, PositionLevel, JobDescDetails]
return (l_return)
# Use driver to open the link
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.linkedin.com/uas/login")
time.sleep(4)
# Use login credentials to login
email=driver.find_element(By.ID, "username")
email.send_keys(USERNAME)
password=driver.find_element(By.ID, "password")
password.send_keys(PASSWORD)
time.sleep(3)
password.send_keys(Keys.RETURN)
[!CAUTION] Mind that you must adjust the construction of s_earchURL-string in case you want to include further variables (ideally defined above) e.g. other filters than the s_pos_level
# Make keywords with eventual blanks "URL-ready"
s_search_pos = '' # string
that will contain URL-ready position-description
for i in keywords4search:
i2 = i.replace(' ', '%20')
s_search_pos+=(i2)+'%20'
s_search_pos=s_search_pos[:-3]
s_searchURL = f'https://www.linkedin.com/jobs/search/?&{s_pos_level}&{s_geo_loc}&distance={s_search_distance}&keywords={s_search_pos}&location={s_search_loc}'
print(s_searchURL)
driver.get(s_searchURL)
Initialize result-list; retriev and store key search information:
result = []
link = driver.current_url
no_posts = int(driver.find_element(By.CLASS_NAME,'.'.join("display-flex t-12 t-black--light t-normal".split())).text.split(' ')[0])
print(f'For position {s_search_pos} there are {no_posts} vacancies.')
npages = (no_posts//25)+1
print(f'No. of pages with results: {npages}')
Loop over result pages, force “slow scrolling” thru all results and retrieve detail info from all cards.
Some (random) sleep commands are inserted after each new page call or card click. The reason is
for i in range(0, npages*25, 25):
print("Page:",int((i/25)+1),"of",npages)
s_searchURL = link+"&start="+str(i)
time.sleep(7)
loop = 0
card_list = f_artificial_scrolling(s_searchURL)
print ('Length of card list on this page: '+ str(len(card_list)))
while True:
try:
current_card = card_list[loop]
loop+=1
current_card.click()
time.sleep(random.randint(1,6))
try:
result.append(f_scrap_position_details())
except:
pass
except:
break
df = pd.DataFrame(result, columns = ["JobTitle", "JobTitleLink", "CompanyName","CompanyLocation", "CompanySize", "PositionLevel","JobDescDetails"])
# replace line return from html via regex
df['JobDescDetails'] = df['JobDescDetails'].replace(r's+|\n', ' ', regex=True)
# Split long link so as to keep only the essential part, the direct link to the job post
df[['DirectJobLink','garbage']] = df['JobTitleLink'].str.split("?", n=1, expand=True)
df = df.drop(['JobTitleLink', 'garbage'], axis = 1)
df.to_excel("_LINKEDIN_JOB_POSTINGS.xlsx")
And here the link to the full Python Notebook:
https://github.com/syrom/ATS/blob/main/LinkedIn_Scrape_V30b.ipynb