I have a data frame where I wish to aggregate my rows by the location column (US STATES)
the Location column is of the the following format.
Location 0 Texas, USA 1 Middle of nowhere 2 NaN 3 Largo, Florida 4 NaN 5 Indiana 6 Upstate NY 7 People's Republic of Chicago 8 South Florida, USA 9 Texas, USA 10 NaN 11 NaN 12 Cardiff, Wales 13 NaN 14 Long Beach CA 15 Texas 16 NaN 17 WithLove StandingWithIsrael 18 Suffolk , Lake Ronkonkoma , NY 19 Illinois, USA
All the tweets that do not belong to US location such as Middle of nowhere, WithLove StandingWithIsrael and NaN will be treated as missing values.
The real problem comes while filtering tweets based on the location as it is not of a standard format. For e.g., tweets belonging to Texas are of format Texas USA, or Tx or Texas or Austin, Texas. How do I normalize the location in a way where it is easier to filter by US States? Any any help would be greatly appreciated.