Posts (Latest 10 updated) : Read all

Link List (Edit):
Contents:
  1. Working with Pandas is a gigantic pain in the ass!
    1. Cheat Sheet
#  ____        ____                 _
# |  _ \ _   _|  _ \ __ _ _ __   __| | __ _ ___
# | |_) | | | | |_) / _` | '_ \ / _` |/ _` / __|
# |  __/| |_| |  __/ (_| | | | | (_| | (_| \__ \
# |_|    \__, |_|   \__,_|_| |_|\__,_|\__,_|___/
#        |___/
#

Working with Pandas is a gigantic pain in the ass!

Make no mistake about it, python pandas is a brilliant feat of programming genius. The pandas library provides a robust datastructure that allows processing massive amounts of data efficiently and with great speed. The Pandas library also allows programmers to convert massive amounts of data into nearly every datastructure available. It can read directly from csv files, json files, databases, and whatever other type of data storage you can thing of. It is without a doubt a zenith of modern pythonistic engineering.

Unfortunately, due to it’s abstract and robust nature, it’s a tremendous pain in the ass to work with. There are simply too many methods and required parameters to keep up with unless you work with it on a daily basis. The Pandas library also has it’s own way of doing things, which may or may not jive with how you do things. Then there are a number of informal “hacks” not mentioned in the docs, on how to get common tasks accomplished. It is brilliant, yes, but so was the first land mine.

Cheat Sheet

  1. Create Dataframe: dataframe = pd.DataFrame(record, columns = ['$COLUMN_NAMES'])
  2. Select Row by Condition: $DATAFRAME_NAME[$DATAFRAME_NAME['$COLUMN_NAME'] == $VALUE] or $DATAFRAME_NAME.loc[$DATAFRAME_NAME['$COLUMN_NAME'] == $VALUE]
  3. Test for equality: $DATAFRAME_NAME.$PROPERTY.equals($TEST_VALUE)
  4. Add column headers to pre-existing dataframe: $DATAFRAME_NAME.columns = ['$COLUMN_NAMES']
  5. Find the most common occurring value in a column:
     df = pd.DataFrame({'care': 98, 'robert': 43, 'shit': 100}, columns=['word', 'score'])
     grouped = df.groupby('word')
     word_rep = grouped['word'].agg(lambda x: x.mode().iloc[0])
    
  6. find the maximum value of a column: df.loc[df['score'].idxmax()]
  7. Get first item from series: ‘<$Series>.item()’
  8. Create series from dataframe: df.<$COLUMN_LABEL> or df['<$COLUMN_LABEL>']

Adding a row to a dataframe

Instead of adding a single row to a dataframe, it is best to add your data in bulk. Often, the common way to do this would be to structure your data into a dict() if not already so, and then lst.append() your data to a lst(). Once you have run all iterations, you would then simply create a dataframe containing all your data, and if so desired, concat your new dataframe with the dataframe of your choosing.

Concat dataframes

Concatenation of dataframes is fairly easy, and solely dependent on how you desire the concatenation to be performed.

Row Concatenation
pd.concat([df1, df2])
Column Concatenation
pd.concat([df1, df2], axis=1)