Data Scientist Gilvandro Neto has written a tutorial on how to extract keywords from news articles and then “create a dataset to use a function that applies the concept of POS tagging to identify keywords.”
The tutorial uses articles about the coronavirus as a timely topic example, searching for and retrieving articles with spaCy and the News API. Neto breaks the tutorial into four parts: setup, coding, conclusion, and future works.
Part One: Setup
Using Notebook on GoogleCollab (alternatives would be IDE or Python Notebook) Neto installed the spaCy using pip in the English Language Model. The News API also has a Python library, which can be installed with pip.
The spaCy English Language Model is then downloaded in the largest size offered. Following this installation, you’ll import the spaCy library. Another library will be used to assist the implementation of the NLP analysis. The spaCy model will be imported and loaded through a variable which Neto called nip_eng.
Part Two: Coding
Neto uses the News API because of it’s simple and speedy ability to search a broad range of publications - and an API key can be created for free. However, he notes that “When we send an HTTP Request, the API returns as much as 100 articles for maximum — you have to pay the dev account to get the total of results.”
The results are broken out in pages, with 20 articles per page. To access the 100 free articles, you’ll need to implement some type of pagination. Also important to note is that the maximum date range for searches is 30 days.
Using Python Notebook on Google Colab means that the dataset should be saved in a Google Drive (just in case)! Neto used the Pickle library to save all of the articles, and includes a how-to for creating and saving the .pckl file.
Part 3: Conclusion
Once the dataset has been walked through the steps of the Pickle library, you’ll have results ready to disseminate. Neto advises that once you have “a dataset with the 5 most common keywords of each article concatenate with another several articles regarding the COVID-19, it’s time to choose the best way of show our results. I’ve choose a WordCloud, a picture that show the words of a text according to it’s frequency..” (sic)
Part 4: Future Work
This project has the capacity to evolve into a more powerful application, and moves us well beyond creating word clouds to find commonalities in datasets. The project can be applied to other parts of DataFrame, or new methods can be developed to link with more than one field. Drawing data from social media monoliths would be great. Neto draws his final thoughts about the project:
“One of my investigations shows that we have enough dataset to train a language model with the COVID-19 news and get answers for some questions with these news. I’m trying this now with BERT and OpenAI GPT-2 — more results soon!"