Top Words in COVID & Corona Domain Names – Text Basics
Top Words in COVID & Corona Domain Names
A few basics on determining the most popular terms in a dataframe of text. In this case, the dataframe is domain names that have been exported into a csv. For my example, I have a list of ~205,000 domain names with that has been compiled with ‘covid’ and/or ‘corona’. My purpose is to understand the other most popular terms associated with these two terms.
[css autolinks=”false” classname=”myclass” collapse=”false” wraplines=”true” firstline=”1″ gutter=”true” highlight=”5, 6, 10″ htmlscript=”false” light=”false” padlinenumbers=”false” smarttabs=”true” tabsize=”4″ toolbar=”true” 1=”is” 2=”true” title=”Import the csv file”]
import pandas as pd
import glob
from pandas import ExcelWriter
from pandas import ExcelFile
domains = pd.read_csv("c:/users/chris/allcorona2.csv", delimiter=’,’ )
domains.head(n=10)
#domains.tail(n=10)
#create a subset of the dataframe
domains = domains[[‘tld’, ‘domainname’, ‘term_split’, ‘termlength’, ‘spellerror’, ‘lema’]]
#or delete specific columns
domains.drop(columns = [‘ref’, ‘spcorrpercent’] )
domains
[/css]
Clean the text by removing any spaces at the beginning of the word, lowercase the text and remove extra spaces. You’ll also see that I found some words that my dictionary did not properly split i.e. ‘co vid’ so I’ve taken these words and replaced them with the correct term – ‘covid’.
[css autolinks=”false” classname=”myclass” collapse=”false” firstline=”1″ gutter=”true” highlight=”5, 11, 14, 17, 20, 27,30 ” htmlscript=”false” light=”false” padlinenumbers=”false” smarttabs=”true” tabsize=”4″ toolbar=”true” title=”Clean the text”]
#split the words into unique elements, clean, stack, etc.
import pandas as pd
import re
#create a new variable using term_split
domains[‘clean’] = domains[‘term_split’]
#remove any spaces at the beginning of the text
domains[‘clean’] = domains[‘clean’].str.strip()
# change all text to be lowercase
domains[‘clean’] = [word.lower() for word in domains[‘clean’]]
#remove hyphens
domains[‘clean’] = domains[‘clean’].astype(str).apply(lambda x: x.replace(‘-‘, ”))
#my dictionary wasn’t always so perfect in seperating ‘covid’ and corona – so I’m padding the term first
domains[‘clean’] = domains[‘clean’].replace(["c ovid", "co vid", "covi d", "cov id", "covi d"], " covid ")
domains[‘clean’] = domains[‘clean’].replace(["c orona", "co rona", "cor ona", "coro na", "coron a"], " corona ")
#ensure terms are properly represented using various methods – replace, lambda
domains[‘clean’] = domains[‘clean’].apply(str)
domains[‘clean’] = domains[‘clean’].replace(to_replace = "virus", value = " virus ")
domains[‘clean’] = domains[‘clean’].replace(str(domains[‘clean’])," ").replace("*virus*"," virus ")
domains[‘clean’] = domains[‘clean’].replace({"corona", " corona "})
# then clean up any double spacing I created
domains[‘clean’] = domains[‘clean’].astype(str).apply(lambda x: x.replace(‘ ‘, ”))
domains[‘clean’] = domains[‘clean’].astype(str).apply(lambda x: x.replace(‘covid’, ‘ covid ‘))
domains[‘clean’].str.split().tolist() #expand=False
set(domains[‘clean’])
#Other useful methods to clean text for ease of reference
# .stack()
# .drop_duplicates()
# .replace(r'[\-\!\@\#\$\%\^\*\(\)\_\+\[\]\;\’\.\,\/\{\}\:\"\;\&\?\|]’,”)
[/css]
With the terms all cleaned, I consolidate all of the terms by flattening them into a list and run my counter.
[css autolinks=”false” classname=”myclass” collapse=”false” firstline=”1″ gutter=”true” highlight=”5, 8, 11″ htmlscript=”false” light=”false” padlinenumbers=”false” smarttabs=”true” tabsize=”4″ toolbar=”true” title=”Flatten into a list”]
import itertools
from itertools import chain
import collections
#add in a column for the number of words in each domainname
domains[‘num_words’] = [len(sentence.split()) for sentence in domains[‘clean’]]
# Flatten the list of words
domainwords = list(chain.from_iterable(map(str.split, domains[‘clean’])))
# Create counter
word_counts = collections.Counter(domainwords)
# show the top x words
word_counts.most_common(20)
[/css]
Looking at the results of the counter, I noticed there are some words I want to exclude. Note, as these are domain names using a out-of-the-box cleaner does not fit my purpose so I’m doing it by hand. This is particularly helpful if you have scrapped files and want to remove html coding.
[css autolinks=”false” classname=”myclass” collapse=”false” firstline=”1″ gutter=”true” highlight=”6-7, 12″ htmlscript=”false” light=”false” padlinenumbers=”false” smarttabs=”true” tabsize=”4″ toolbar=”true” title=”Create a list of words to remove from my list.”]
import itertools
from itertools import chain
import collections
# remove words not wanted in the list. Keep in mind the domain names had to have the term covid or corona to be included in the dataset
remove_words = [‘a’, ‘and’, ‘to’, ‘the’, ‘in’, ‘of’, ‘for’, ‘is’, ‘on’, ‘e’, ‘it’, ‘be’, ‘as’, ‘s’, ‘covid’, ‘corona’, ’19’]
domainwords = [word.lower() for word in domainwords if word.lower() not in remove_words]
# Create counter
word_counts = collections.Counter(domainwords)
# show the top x words
word_counts.most_common(100)
[/css]
Lists are boring…so a quick bar graph to show the top 25 terms.
[css autolinks=”false” classname=”myclass” collapse=”false” firstline=”1″ gutter=”true” highlight=”2, 5, 7″ htmlscript=”false” light=”false” padlinenumbers=”false” smarttabs=”true” tabsize=”4″ toolbar=”true” title=”Create a quick bar graph”]
from matplotlib import pyplot as plt
word_counts_a = pd.DataFrame(word_counts.most_common(25), columns=[‘words’, ‘count’])
word_counts_a.head()
fig, ax = plt.subplots(figsize=(15, 8))
# Plot horizontal bar graph
word_counts_a.sort_values(by = ‘count’).plot.barh(x = ‘words’,
y = ‘count’,
ax = ax ,
color = "blue")
ax.set_title("Top Words contained with ‘COVID’ & ‘Corona’ Domain Names Registered in 2020")
plt.show()
len(word_counts_a)
[/css]
We can easily see the biggest term is ‘virus’ with the corona/covid term which isn’t a big surprise but let’s zoom in a bit more, bar chart after the top three words.
[css autolinks=”false” classname=”myclass” collapse=”false” firstline=”1″ gutter=”true” highlight=”2″ htmlscript=”false” light=”false” padlinenumbers=”false” smarttabs=”true” tabsize=”4″ toolbar=”true” title=”Create a Pandas dataframe.”]
#view a subset
word_counts_b = word_counts_df[3:40]
fig, ax = plt.subplots(figsize=(15, 17))
# Plot horizontal bar graph
word_counts_b.sort_values(by = ‘count’).plot.barh(x = ‘words’,
y = ‘count’,
ax = ax ,
color = "purple")
ax.set_title("Top Words (after the top word) contained with ‘COVID’ & ‘Corona’ Domain Names Registered in 2020")
plt.show()
len(word_counts_a)
word_counts_df
[/css]
So now we have a better perspective on the relative volume of the other terms.
Create a Pandas dataframe with the top 5,000 terms for further analysis. One last step:
[css autolinks=”false” classname=”myclass” collapse=”false” firstline=”1″ gutter=”true” highlight=”1, 4″ htmlscript=”false” light=”false” padlinenumbers=”false” smarttabs=”true” tabsize=”4″ toolbar=”true” 1=”is” 2=”Create a Pandas dataframe with the top 5,000 names”]
word_counts_df = pd.DataFrame(word_counts.most_common(5000), columns = [‘words’, ‘count’])
#view a subset
word_counts_b = word_counts_df[1:40]
word_counts_df
[/css]
And now, I can work on analyzing these words or take the entire dataset for further analysis.