# Slow data.table subsetting vs dplyr

I have been using the following format of dplyr code

`````` group_by(dt, ID) %>%
filter(any(colY == 1 & colX == 10))
``````

To subset a data.table like the following

```ID  colX    colY
1111    3   1
1111    2   1
1111    6   0
1111    9   0
2222    10  1
2222    3   1
2222    5   0
2222    7   0
3333    8   1
3333    10  1
3333    3   0
3333    2   0
```

into

```ID  colX    colY
2222    10  1
2222    3   1
2222    5   0
2222    7   0
3333    8   1
3333    10  1
3333    3   0
3333    2   0
```

Filtering through about 900k rows to get the second table takes about 1.3 seconds.

I have been trying to implement a data.table subset that would be faster, but so far the results have only taken longer. Using the following data.table subset

``````dt[ , .SD[any( (colY == 1 & colX == 10)) ], ID]
``````

Takes ~14 seconds. What seems to be the problem here?

This might be faster. It avoids the use of `.SD` and uses the row numbers given in `.I` instead.

``````dt[dt[, .I[any(colX == 10 & colY == 1)], by = ID]\$V1]
#      ID colX colY
# 1: 2222   10    1
# 2: 2222    3    1
# 3: 2222    5    0
# 4: 2222    7    0
# 5: 3333    8    1
# 6: 3333   10    1
# 7: 3333    3    0
# 8: 3333    2    0
``````

The inner data table call `dt[, .I[any(colX == 10 & colY == 1)], by = ID]\$V1` is giving us the row numbers for those groups that match our condition. `.I` gives us the row location per group. We can see what the result will be by printing our call:

``````dt[, print(.I[any(colX == 10 & colY == 1)]), by = ID]
# integer(0)
# [1] 5 6 7 8
# [1]  9 10 11 12
# Empty data.table (0 rows) of 1 col: ID
``````

Then we just use that result as a row subset on the original data table.