pandas apply lambda_Pandas 练习

学过python，并且希望从事数据分析的都应该知道，pandas是一个经常用到的库，所以对这个库的学习就比较重要的。找了很久终于找到了一个比较合适的练习题，这个练习题是Github上面的，地址在这guipsamora/pandas_exercisesgithub.com有兴趣的可以去试着做一下，就像作者说的，to learn is to do. So unless you practice yo

weixin_39605706

1081人浏览 · 2020-11-22 21:05:08

weixin_39605706 · 2020-11-22 21:05:08 发布

学过python，并且希望从事数据分析的都应该知道，pandas是一个经常用到的库，所以对这个库的学习就比较重要的。

找了很久终于找到了一个比较合适的练习题，这个练习题是Github上面的，地址在这

guipsamora/pandas_exercisesgithub.com

有兴趣的可以去试着做一下，就像作者说的，to learn is to do. So unless you practice you won't learn.

整体的结构很清晰

总共有11个练习章节，由浅入深，刚好可以通过这个练习来检验一下自己的pandas水平

欧克，让我们开始吧

第一个，Getting&Knowing Your Data ----理解数据之Chipo

由于这个章节的题都很基础，我就一带而过，不具体展开了Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet. Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.Step 1. Import the necessary libraries

In

Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called chipo.
In [ ]:
#这两题也是很简单的，从给出的地址导入数据

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = 't')

第四题
Step 4. See the first 10 entries
chipo.head(10)

#output

显示chipo这个数据集的前10行数据

第五题
#这个题也很简单，让求这个数据集的数据量，要求给出两种方法
Step 5. What is the number of observations in the dataset?
In [1]:chipo.shape[0]
# Solution 1
In [2]:chipo.info()
# Solution 2

#上面两个语句都是可以求得数据量的

Step 6. What is the number of columns in the dataset?

#这个题和第五题考的内容基本一致
chipo.shape[1]

Step 7. Print the name of all the columns.

chipo.columns

Step 8. How is the dataset indexed?

In [ ]:#考数据集的索引的
chipo.index

Step 9. Which was the most-ordered item?

In [ ]:
#这个是考订购最多的item
c = chipo.groupby('item_name')
c.max().head(1)
Step 10. For the most-ordered item, how many items were ordered?
In [ ]:#让在第九题的基础上，求出有多少item被订购了
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

第11题Step 11. What was the most ordered item in the choice_description column?

In [ ]:
c = chipo.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

Step 12. How many items were orderd in total?

In [ ]:
#求和
total_items_ordered = chipo['quantity'].sum()
total_items_ordered

Step 13. Turn the item price into a float

#首先看一下price这一列是什么类型的数据，

Step 13.a. Check the item price type

In [ ]:chipo['item_price'].dtype

Step 13.b. Create a lambda function and change the type of item price

In [ ]:#原题让使用一个lambda函数，来转换数据类型
func = lambda x: float(x[1:-1])#我这里使用了[1:-1]代表了全部的数据，
chipo.item_price = chipo.item_price.apply(func)

Step 13.c. Check the item price type

In [ ]:
chipo.item_price.dtype

#其实这里还有另外一种解法

chipo.item_price = pd.to_numeric(chipo.item_price,
                                downcast='float')
chipo.item_price.dtype
#引用to_numeric这个函数，可以直接对整列数据进行转换。

Step 14. How much was the revenue for the period in the dataset?

In [ ]:#revenue是收益的意思
#revenue=quantity*price
revenue = (chipo['quantity'] * chipo['item_price']).sum()
print('Revenue was :$',str(revenue))

Step 15. How many orders were made in the period?

In [ ]:
orders = chipo.order_id.value_counts().count()
orders

Step 16. What is the average revenue amount per order?

In [3]:
# Solution 1
#首先要求出总的revenue
revenue = (chipo['quantity'] * chipo['item_price']).sum()
#和总的orders
orders = chipo.order_id.value_counts().count()
avg = revenue/orders
avg
In [4]:
# Solution 2
#可以直接使用groupby函数
chipo.groupby(by=['order_id']).sum().mean()['revenue']
#先按照id分组，然后求和在求平均，然后取出revenue列

Step 17. How many different items are sold?

In [ ]:

chipo.item_name.value_counts().count()

到这里第一章节的第一部分，就结束了

继续下一部分

Occupation

Ex3 - Getting and Knowing your Data

This time we are going to pull data directly from the internet. Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

In [ ]:


Step 2. Import the dataset from this address.

url = ('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user')
users = pd.read_csv(url,sep='|')

Step 3. Assign it to a variable called users and use the 'user_id' as index

In [ ]:url = ('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user')
users = pd.read_csv(url,sep='|',index_col='user_id')

Step 4. See the first 25 entries

In [ ]:users.head(25)

Step 5. See the last 10 entries

In [ ]:
#tail是从后往前显示的
users.tail(10)

Step 6. What is the number of observations in the dataset?

In [ ]:users.shape[0]

Step 7. What is the number of columns in the dataset?

In [ ]:
users.shape[1]

Step 8. Print the name of all the columns.

In [ ]:print(users.columns)

Step 9. How is the dataset indexed?

In [ ]:users.index

Step 10. What is the data type of each column?

In [ ]:
users.dtypes

Step 11. Print only the occupation column

In [ ]:users.occupation
#or
users['occupation']

Step 12. How many different occupations there are in this dataset?

In [ ]:users.occupation.nunique()

Step 13. What is the most frequent occupation?

In [ ]:
users.occupation.value_counts().head(1)

Step 14. Summarize the DataFrame.

In [ ]:
users.describe()

Step 15. Summarize all the columns

In [ ]:
#因为describe默认的是只对数字型起作用
users.describe(include='all')

Step 16. Summarize only the occupation column

In [ ]: users.occupation.describe()

Step 17. What is the mean age of users?

In [ ]:
users.age.mean()

Step 18. What is the age with least occurrence?

In [ ]:

 users.age.value_counts().tail()

小总结：基础的了解数据集的过程，是很程序化的，基本就是这上面提到的一些语句，head，tail，count，describe，mean,index，columns,info,nuique等等吧，主要使用这些语句来一窥数据集，了解数据集，获取数据集。

下面开始第二部分，分组和聚合数据

Filtering and Sorting Data

This time we are going to pull data directly from the internet. Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet. Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
Step 1. Import the necessary libraries
In [ ]:


Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called chipo.
In [ ]:
这里之前已经做过，不在重复

¶
In [ ]:

Step 5. Sort by the name of the item

In [ ]:
chipo.item_name.sort_values()
#or
chipo.sort_values(by='item_name')['item_name']

Step 6. What was the quantity of the most expensive item ordered?

In [ ]:
chipo.sort_values(by = "item_price", ascending = False).head(1)

Step 7. How many times were a Veggie Salad Bowl ordered?

In [ ]:chipo_salad = chipo[chipo.item_name == "Veggie Salad Bowl"]

len(chipo_salad)

Step 8. How many times people orderd more than one Canned Soda?

In [ ]:

 chipo_drink_steak_bowl = chipo[(chipo.item_name == "Canned Soda") & (chipo.quantity > 1)]
len(chipo_drink_steak_bowl)

敲了一天的代码，就到此为止吧！

如果你觉得这篇文章还不错，希望能给我一个赞，虽然我写这个专栏的目的是希望自己能够通过写专栏的形式，倒逼自己去每天学习，但是你们的一个赞将给与我莫大的支持。谢谢！

开放原子开发者工作坊

开放原子开发者工作坊旨在鼓励更多人参与开源活动，与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动，如meetup、训练营等，主打技术交流，干货满满，真诚地邀请各位开发者共同参与！

更多推荐

多模态大模型&科学计算双管齐下，百度飞桨两大赛项报名倒计时！

第二届开放原子大赛是由开放原子开源基金会组织举办的开源技术领域专业赛事，聚焦解决真问题，重点覆盖基础软件、工业软件、人工智能大模型等领域

开放原子开发者工作坊

以智能致世界 | 操作系统大会2024议程全览

开放原子开发者工作坊

开放原子开源基金会新增捐赠人（2024年9月）

2024年9月，新增以下单位成为开放原子开源基金会及旗下项目捐赠人。

开放原子开发者工作坊

所有评论(0)

查看更多评论

weixin_39605706

@weixin_39605706

已为社区贡献1条内容