Python Code Snippets

Handy Python Code Snippets

For when those moments arise and you know you’ve done something before and yet can’t seem to quickly find it, some quick Python Code that has served me well.

Use Variable Filenames

#variables
keyword = ‘term_to_use’
timeframe = ‘2020’
folder_in = ‘raw_data/’
folder_out = ‘processed’

#Open file:
df3 = pd.read_csv (“C:/users/” + ” folder_in + “all_raw_files/” + one_file + “.csv”, low_memory=False)

Write to file: outputfile3 = open(“C:/users/” + folder_out + subfolder + keyword + timeframe +”_OUT” + “.csv”, ‘w’, encoding=”utf-8″, newline=”) #add “index=False” to eliminate index column

Compare Two Data Frames

# Get all different values
diff_in_df = pd.merge(df1, df2, on=”optional_columnname”, how=’outer’, indicator=’Exist’)
diff_in_df = diff_in_df.loc[diff_in_df[‘Exist’] != ‘both’]
diff_in_df.info(verbose=True)

Create New Data Frame using Column Names

df2 = df[[“name”, “id”]].copy()
Rename all column names: df2.columns = [“col1_new_name”, “col2_new_name2”]
Rename one column: df.rename(columns = {‘index’: ‘ranking’}, inplace=True)

Or just drop specific columns: df2 = df2.drop(columns = [‘column_to_dropA’, ‘column_to_dropB’])

Drop duplicates in specific column: df.drop_duplicates(subset = [‘col_name’], keep = “first”, inplace = True)

Get shape of dt: df.shape

Get type: type(df)

Text Cleaning

Remove Characters and Pop String (leaves NO spaces)

import string
string_to_clean = “STr1ing 2x space @Some4weird_ch^r-2021″
#string_to_clean = str(df[‘term’])

# Translation table which removes ONLY all digits and leaves no spaces between the characters
# setup the translation table that removes all numbers
translation_table = str.maketrans(”, ”, string.digits)

clean_string = string_to_clean.translate(translation_table)
print(clean_string)

String Split – keep last item
df[‘stringA’] = df.column_name.str.rsplit(‘.’).str.get(-1)

Drop text than is less than x in character length: df2 = df[‘col_name_with_term’].str.findall(‘\w{x,}’).str.join(‘ ‘) #change x to desired value

Drop Duplicate Terms in a Column

#df2 = df.drop_duplicates(‘col_name’)

Remove Custom Words

removewords = {‘goofyword1’, ‘omggoofyword2’, ‘supergoofyword3’ }
df2 = df[~df.term.isin(removewords)]

Clean Strings using RegEx

pattern = “[,.;@#?-!&$0123456789]+” #remove numeric values
pattern2 = ‘(?<!\d)[.,-;:](?!\d)’ #second option
df[‘term2’]= df[‘term’].str.replace(pattern,’ ‘, regex = True)
df[‘term2’]

Process Time

#pip install pytest-timeit
import timeit
tic = timeit.default_timer()
<code to process and time>
toc = timeit.default_timer()
toc – tic #elapsed time in seconds (ns = nanoseconds (10^-9) µs is micro seconds (10^-6)
print (toc – tic)

Quick Table

#!pip install sidetable
import sidetable
frame.stb.freq([‘category_name’, ‘cat2′], style=True) #, value=’numeric_variable’, thresh=5, #other options: “value_counts, freq

table = table[tabl[‘category_name’].apply(lambda x: len(x)>1)] #to filter the table with variable > 1 in column ‘category_name’ but can only be done with ‘style=False’
note: tabl should be table but WP will create a table with the code.

Dates

Format Date/Time

pd.to_datetime(df[‘ddate’], infer_datetime_format=True)

import datetime
frame[‘r_date’] = np.datetime64(‘2021-01-12’)
frame[‘days’] = r_date – frame[‘ddate’]
if frame[‘days’].dt.total_seconds !=0:
frame[‘pre_date’] = frame[‘ddate’]
else:
“0”

# combine all words into one big long string
import nltk
text_combined = str(text)
doc = nlp(text_combined)

for ent in doc.ents:
print(ent.text,ent.label_)
text_combined

Read a CSV file as a dictionary object

import csv
with open(‘C:/users/…filename.csv’) as myFile:

reader = csv.DictReader(myFile)
for row in reader:

print(row[‘term’]) #update the variable for the column(s) you want to import

Build a Dataframe

import pandas as pd

df = pd.DataFrame({‘w’: [], ‘sim’: []})

for wo, simi in zip(splitted, evaluated_s):

df = df.append({‘word’: wo, ‘sim’: evaluated_s}, ignore_index=True)

With CSV File:

import pandas as pd
c = pd.read_csv(“C:/filename.csv”, index_col=False)
print(c.dtypes) #view datatypes

c = c.drop(columns = [‘Unnamed: 0’, ‘date’) #drop columns

Setup and use a Virtual Environment

Launch Anaconda Prompt: conda create -n HF_VE python=3.11 anaconda (Creating the virtual named ‘HF_VE’)

Proceed = Y

To Activate the virtual environment: conda activate HF_VE

To Deactivate: Conda deactivate

Then launch Jupyter notebook from the terminal: jupyter notebook

Change the directory to my project directory: Jupyter notebook

How to determine if you have a GPU?

Install Torch: !pip3 install torch torchvision torchaudio –index-url https://download.pytorch.org/whl/cu121 -U

import torch
print(torch.cuda.is_available())

#Install GPUtil to find out details of your GPU such as the number, load, temperature
!pip install GPUtil

#note it should show 0 as a result
import GPUtil
GPUtil.getAvailable()

#Get the list of available GPUs
gpus = GPUtil.getGPUs()

for gpu in gpus:
print(“GPU ID:”, gpu.id)
print(“GPU Name:”, gpu.name)
print(“GPU Utilization:”, gpu.load * 100, “%”)
print(“GPU Memory Utilization:”, gpu.memoryUtil * 100, “%”)
print(“GPU Temperature:”, gpu.temperature, “C”)
print(“GPU Total Memory:”, gpu.memoryTotal,”MB”)