我有一组大型数组(每个大约600万个元素),我想基本上执行一个np.digitize但是多个轴.我正在寻找有关如何有效地执行此操作以及如何存储结果的一些建议.

我需要数组A的所有索引(或所有值或掩码),其中数组B的值在一个范围内,而数组C的值在另一个范围内,D在另一个范围内.我想要值,索引或掩码,以便我可以做一些关于每个bin中A数组的值的尚未确定的统计信息.我还需要每个bin中的元素数量,但len()可以做到这一点.

这是一个我认为合理的例子:

import itertools

import numpy as np

A = np.random.random_sample(1e4)

B = (np.random.random_sample(1e4) + 10)*20

C = (np.random.random_sample(1e4) + 20)*40

D = (np.random.random_sample(1e4) + 80)*80

# make the edges of the bins

Bbins = np.linspace(B.min(), B.max(), 10)

Cbins = np.linspace(C.min(), C.max(), 12) # note different number

Dbins = np.linspace(D.min(), D.max(), 24) # note different number

B_Bidx = np.digitize(B, Bbins)

C_Cidx = np.digitize(C, Cbins)

D_Didx = np.digitize(D, Dbins)

a_bins = []

for bb, cc, dd in itertools.product(np.unique(B_Bidx),

np.unique(C_Cidx),

np.unique(D_Didx)):

a_bins.append([(bb, cc, dd), [A[np.bitwise_and((B_Bidx==bb),

(C_Cidx==cc),

(D_Didx==dd))]]])

然而,这让我感到紧张,我将在大型阵列上耗尽内存.

我也可以这样做:

b_inds = np.empty((len(A), 10), dtype=np.bool)

c_inds = np.empty((len(A), 12), dtype=np.bool)

d_inds = np.empty((len(A), 24), dtype=np.bool)

for i in range(10):

b_inds[:,i] = B_Bidx = i

for i in range(12):

c_inds[:,i] = C_Cidx = i

for i in range(24):

d_inds[:,i] = D_Didx = i

# get the A data for the 1,2,3 B,C,D bin

print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]]

至少此处输出具有已知且恒定的大小.

有没有人对如何更聪明地做出更好的想法?还是需要澄清?

根据HYRY的答案,这是我决定采取的路径.

import numpy as np

import pandas as pd

np.random.seed(42)

A = np.random.random_sample(1e7)

B = (np.random.random_sample(1e7) + 10)*20

C = (np.random.random_sample(1e7) + 20)*40

D = (np.random.random_sample(1e7) + 80)*80

# make the edges of the bins we want

Bbins = np.linspace(B.min(), B.max(), 9)

Cbins = np.linspace(C.min(), C.max(), 10) # note different number

Dbins = np.linspace(D.min(), D.max(), 11) # note different number

sA = pd.Series(A)

cB = pd.cut(B, Bbins, include_lowest=True)

cC = pd.cut(C, Cbins, include_lowest=True)

cD = pd.cut(D, Dbins, include_lowest=True)

dat = pd.DataFrame({'A':A, 'cB':cB.labels, 'cC':cC.labels, 'cD':cD.labels})

g = sA.groupby([cB.labels, cC.labels, cD.labels]).indices

# this then gives all the indices that match the group

print g[0,1,2]

# this is all the array A data for that B,C,D bin

print sA[g[0,1,2]]

即使对于大型阵列,这种方法看起来也很快.

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐