Handy Python Code Snippets
For when those moments arise and you know you’ve done something before and yet can’t seem to quickly find it, some quick Python Code that has served me well.
Use Variable Filenames
#variables
keyword = ‘term_to_use’
timeframe = ‘2020’
folder_in = ‘raw_data/’
folder_out = ‘processed’
#Open file:
df3 = pd.read_csv (“C:/users/” + ” folder_in + “all_raw_files/” + one_file + “.csv”, low_memory=False)
Write to file: outputfile3 = open(“C:/users/” + folder_out + subfolder + keyword + timeframe +”_OUT” + “.csv”, ‘w’, encoding=”utf-8″, newline=”) #add “index=False” to eliminate index column
Compare Two Data Frames
# Get all different values
diff_in_df = pd.merge(df1, df2, on=”optional_columnname”, how=’outer’, indicator=’Exist’)
diff_in_df = diff_in_df.loc[diff_in_df[‘Exist’] != ‘both’]
diff_in_df.info(verbose=True)
Create New Data Frame using Column Names
df2 = df[[“name”, “id”]].copy()
Rename all column names: df2.columns = [“col1_new_name”, “col2_new_name2”]
Rename one column: df.rename(columns = {‘index’: ‘ranking’}, inplace=True)
Or just drop specific columns: df2 = df2.drop(columns = [‘column_to_dropA’, ‘column_to_dropB’])
Drop duplicates in specific column: df.drop_duplicates(subset = [‘col_name’], keep = “first”, inplace = True)
Get shape of dt: df.shape
Get type: type(df)
Text Cleaning
Remove Characters and Pop String (leaves NO spaces)
import string
string_to_clean = “STr1ing 2x space @Some4weird_ch^r-2021″
#string_to_clean = str(df[‘term’])
# Translation table which removes ONLY all digits and leaves no spaces between the characters
# setup the translation table that removes all numbers
translation_table = str.maketrans(”, ”, string.digits)
clean_string = string_to_clean.translate(translation_table)
print(clean_string)
String Split – keep last item
df[‘stringA’] = df.column_name.str.rsplit(‘.’).str.get(-1)
Drop text than is less than x in character length: df2 = df[‘col_name_with_term’].str.findall(‘\w{x,}’).str.join(‘ ‘) #change x to desired value
Drop Duplicate Terms in a Column
#df2 = df.drop_duplicates(‘col_name’)
Remove Custom Words
removewords = {‘goofyword1’, ‘omggoofyword2’, ‘supergoofyword3’ }
df2 = df[~df.term.isin(removewords)]
Clean Strings using RegEx
pattern = “[,.;@#?-!&$0123456789]+” #remove numeric values
pattern2 = ‘(?<!\d)[.,-;:](?!\d)’ #second option
df[‘term2’]= df[‘term’].str.replace(pattern,’ ‘, regex = True)
df[‘term2’]
Process Time
#pip install pytest-timeit
import timeit
tic = timeit.default_timer()
<code to process and time>
toc = timeit.default_timer()
toc – tic #elapsed time in seconds (ns = nanoseconds (10-9 ) µs is micro seconds (10-6)
print (toc – tic)
Quick Table
#!pip install sidetable
import sidetable
frame.stb.freq([‘category_name’, ‘cat2′], style=True) #, value=’numeric_variable’, thresh=5, #other options: “value_counts, freq
table = table[tabl[‘category_name’].apply(lambda x: len(x)>1)] #to filter the table with variable > 1 in column ‘category_name’ but can only be done with ‘style=False’
note: tabl should be table but WP will create a table with the code.
Dates
Format Date/Time
pd.to_datetime(df[‘ddate’], infer_datetime_format=True)
import datetime
frame[‘r_date’] = np.datetime64(‘2021-01-12’)
frame[‘days’] = r_date – frame[‘ddate’]
if frame[‘days’].dt.total_seconds !=0:
frame[‘pre_date’] = frame[‘ddate’]
else:
“0”
# combine all words into one big long string
import nltk
text_combined = str(text)
doc = nlp(text_combined)
for ent in doc.ents:
print(ent.text,ent.label_)
text_combined
Read a CSV file as a dictionary object
import csv
with open(‘C:/users/…filename.csv’) as myFile:
reader = csv.DictReader(myFile)
for row in reader:
print(row[‘term’]) #update the variable for the column(s) you want to import
Build a Dataframe
import pandas as pd
df = pd.DataFrame({‘w’: [], ‘sim’: []})
for wo, simi in zip(splitted, evaluated_s):
df = df.append({‘word’: wo, ‘sim’: evaluated_s}, ignore_index=True)
With CSV File:
import pandas as pd
c = pd.read_csv(“C:/filename.csv”, index_col=False)
print(c.dtypes) #view datatypes
c = c.drop(columns = [‘Unnamed: 0’, ‘date’) #drop columns
Setup and use a Virtual Environment
Launch Anaconda Prompt: conda create -n HF_VE python=3.11 anaconda (Creating the virtual named ‘HF_VE’)
Proceed = Y
To Activate the virtual environment: conda activate HF_VE
To Deactivate: Conda deactivate
Then launch Jupyter notebook from the terminal: jupyter notebook
Change the directory to my project directory: Jupyter notebook
How to determine if you have a GPU?
Install Torch: !pip3 install torch torchvision torchaudio –index-url https://download.pytorch.org/whl/cu121 -U
import torch
print(torch.cuda.is_available())
#Install GPUtil to find out details of your GPU such as the number, load, temperature
!pip install GPUtil
#note it should show 0 as a result
import GPUtil
GPUtil.getAvailable()
#Get the list of available GPUs
gpus = GPUtil.getGPUs()
for gpu in gpus:
print(“GPU ID:”, gpu.id)
print(“GPU Name:”, gpu.name)
print(“GPU Utilization:”, gpu.load * 100, “%”)
print(“GPU Memory Utilization:”, gpu.memoryUtil * 100, “%”)
print(“GPU Temperature:”, gpu.temperature, “C”)
print(“GPU Total Memory:”, gpu.memoryTotal,”MB”)