Still More Stream-Mining Frequent Itemsets

Still More Stream-Mining

Counting Items

Extensions

Approaches

Elephants and Troops

Experimental Setup

After Each Night’s Processing . . .

After a Week . . .

Mining Streams Vs. Mining DB’s (New Topic)

Stationarity

Some Options for Frequent Itemsets

Exponentially Decaying Windows

Example: Counting Items

Counting Items --- (2)

Extension to Larger Itemsets*

Initiation of New Counts

How Many Counts?

Dostları ilə paylaş:

Still More Stream-Mining Frequent Itemsets

Still More Stream-Mining

Frequent Itemsets

Elephants and Troops

Exponentially Decaying Windows

Counting Items

Problem: given a stream, which items appear more than s times in the window?

Possible solution: think of the stream of baskets as one binary stream per item.

Extensions

In principle, you could count frequent pairs or even larger sets the same way.

Drawbacks:

Approaches

“Elephants and troops”: a heuristic way to converge on unusually strongly connected itemsets.

Exponentially decaying windows: a heuristic for selecting likely frequent itemsets.

Elephants and Troops

When Sergey Brin wasn’t worrying about Google, he tried the following experiment.

Goal: find unusually correlated sets of words.

Experimental Setup

The data was an early Google crawl of the Stanford Web.

Each night, the data would be streamed to a process that counted a preselected collection of itemsets.

After Each Night’s Processing . . .

Find the most correlated sets counted.

Construct a new collection of itemsets to count the next night.

After a Week . . .

The pair {“elephants”, “troops”} came up as the big winner.

Why? It turns out that Stanford students were playing a Punic-War simulation game internationally, where moves were sent by Web pages.

Mining Streams Vs. Mining DB’s (New Topic)

Unlike mining databases, mining streams doesn’t have a fixed answer.

We’re really mining in the “Stat” point of view, e.g., “Which itemsets are frequent in the underlying model that generates the stream?”

Stationarity

Two different assumptions make a big difference.

Some Options for Frequent Itemsets

We could:

Exponentially Decaying Windows

If stream is a1, a2,… and we are taking the sum of the stream, take the answer at time t to be: Σi = 1,2,…,t ai e -c (t-i).

c is a constant, presumably tiny, like 10-6 or 10-9.

Example: Counting Items

If each ai is an “item” we can compute the characteristic function of each possible item x as an E.D.W.

That is: Σi = 1,2,…,t δi e -c (t-i), where δi = 1 if ai = x, and 0 otherwise.

Counting Items --- (2)

Suppose we want to find those items of weight at least ½.

Important property: sum over all weights is e c/(e c – 1) or very close to 1/c.

Thus: at most 2/c items have weight at least ½.

Extension to Larger Itemsets*

Count (some) itemsets in an E.D.W.

When a basket B comes in:

Initiation of New Counts

Start a count for an itemset S ⊆B if every proper subset of S had a count prior to arrival of basket B.

Example: Start counting {i, j } iff both i and j were counted prior to seeing B.

Example: Start counting {i, j, k } iff {i, j }, {i, k }, and {j, k } were all counted prior to seeing B.

How Many Counts?

Counts for single items = (2/c ) times the average number of items in a basket.

Counts for larger itemsets = ??. But we are conservative about starting counts of large sets.