Stop words

Ask a question or request a feature related to the website or forum...

Moderator: scott

Post Reply
User avatar
scott
Site Admin
Site Admin
Posts: 1409
Joined: Tue Nov 04, 2003 7:05 am
Location: Colorado
Contact:

Stop words

Post by scott »

Here is the list of stop words that came with phpbb2. A stop word is a word which is filtered out of the search index. I just looked at this list recently and was surprised by quite a few of the words I saw there.

Using stop words improves the quality of search results and dramatically improves performance. The trade off is this: the more stop words on your list the more fine tuned and faster your results will be *as long as* you're not searching for one of the stop words.

On the other hand, now that I implemented support for searching on quoted phrases, maybe we should have no stop words at all. I'm afraid the trade off in lower performance and quality of results would be pretty bad.

A lot but not all of the words on this list have got to go. What do you think?

>>>>> 257 words >>>>>
a
about
after
ago
all
almost
along
alot
also
am
an
and
answer
any
anybody
anybodys
anywhere
are
arent
around
as
ask
askd
at
bad
be
because
been
before
being
besslers
best
better
between
big
btw
but
by
can
cant
come
could
couldnt
day
days
days
did
didnt
do
does
doesnt
dont
down
each
etc
either
else
even
ever
every
everybody
everybodys
everyone
far
find
for
found
from
get
go
going
gone
good
got
gotten
had
has
have
havent
having
her
here
hers
him
his
home
how
hows
href
I
i
Ive
if
in
ini
into
is
isnt
it
its
its
just
ken
know
large
less
like
liked
little
looking
look
looked
looking
lot
maybe
many
me
more
most
much
must
mustnt
my
near
need
never
new
news
no
none
not
nothing
now
of
off
often
old
on
one
once
only
oops
or
other
our
ours
out
over
page
please
put
question
questions
questioned
quote
rather
really
recent
said
same
saw
say
says
she
see
sees
should
sites
small
so
some
something
sometime
somewhere
soon
take
than
true
thank
that
thatd
thats
the
their
theirs
theres
theirs
them
then
there
these
they
theyll
theyd
theyre
this
those
though
through
thus
time
times
to
too
under
until
untrue
up
upon
use
users
version
very
via
want
was
way
we
weights
well
went
were
werent
what
when
where
which
who
whom
whose
why
wide
will
with
within
without
wont
world
worse
worst
would
wrote
www
yes
yet
you
youd
youll
your
youre
yours
AFAIK
IIRC
LOL
ROTF
ROTFLMAO
YMMV
>>>>>>>>>>>>>>>>>>>>>>>>
Last edited by scott on Sat Jul 31, 2010 4:22 am, edited 2 times in total.
Thanks for visiting BesslerWheel.com

"Liberty is the Mother, not the Daughter of Order."
- Pierre Proudhon, 1881

"To forbid us anything is to make us have a mind for it."
- Michel de Montaigne, 1559

"So easy it seemed, once found, which yet unfound most would have thought impossible!"
- John Milton, 1667
User avatar
scott
Site Admin
Site Admin
Posts: 1409
Joined: Tue Nov 04, 2003 7:05 am
Location: Colorado
Contact:

re: Stop words

Post by scott »

Here's a first attempt at a minimalist approach. Using this list would dramatically increase the database size and probably slow most searches to a crawl, but it's a good starting point from the other side of the equation.


>>>>> 32 words >>>>>
a
about
am
an
are
as
at
be
by
com
for
from
how
I
in
is
it
of
on
or
that
the
this
to
was
what
when
where
who
will
with
www
>>>>>>>>>>>>>>>>>
Last edited by scott on Sat Jul 31, 2010 7:59 pm, edited 1 time in total.
Thanks for visiting BesslerWheel.com

"Liberty is the Mother, not the Daughter of Order."
- Pierre Proudhon, 1881

"To forbid us anything is to make us have a mind for it."
- Michel de Montaigne, 1559

"So easy it seemed, once found, which yet unfound most would have thought impossible!"
- John Milton, 1667
User avatar
jim_mich
Addict
Addict
Posts: 7467
Joined: Sun Dec 07, 2003 12:02 am
Location: Michigan
Contact:

Post by jim_mich »

This will take a little contemplation.

It looks like the original list was compiled by the computer based upon how often a word was used, else why were 'Besslers' and 'weights' be added?

PS. you have 'the' listed twice. [edit] You corrected/changed the list.

Image
User avatar
scott
Site Admin
Site Admin
Posts: 1409
Joined: Tue Nov 04, 2003 7:05 am
Location: Colorado
Contact:

Post by scott »

You're right Jim thanks for reminding me. I did edit this file long ago to improve search results. I definitely added a few words like "besslers" and "weights" back then. Notice it's just the plurals though. Not exactly sure at this point, but I vaguely remember thinking that it improved search performance and quality of results. Looks like I made a mistake.
Last edited by scott on Sat Jul 31, 2010 7:58 pm, edited 1 time in total.
User avatar
path_finder
Addict
Addict
Posts: 2372
Joined: Wed Dec 10, 2008 9:32 am
Location: Paris (France)

re: Stop words

Post by path_finder »

Dear Scott,
IMHO there is no major disconvenience to ignore the words of your list, because almost are just universal words with no particular significance and wich should not be present in any pertinent research criteria.

I take the opportunity here to thanks you again for your job and the way we have here to share our ideas.
I cannot imagine why nobody though on this before, including myself? It is so simple!...
User avatar
scott
Site Admin
Site Admin
Posts: 1409
Joined: Tue Nov 04, 2003 7:05 am
Location: Colorado
Contact:

Post by scott »

Thanks path_finder.

Here's another common stop word list that's much more aggressive with 429 words: http://www.lextek.com/manuals/onix/stopwords1.html

I'm pretty sure we don't want to go this direction though.
User avatar
scott
Site Admin
Site Admin
Posts: 1409
Joined: Tue Nov 04, 2003 7:05 am
Location: Colorado
Contact:

Post by scott »

I tracked down the original phpbb2 stopwords list and compared it to mine. I only added the following 6 words:

besslers
i
ken
one
same
weights
User avatar
scott
Site Admin
Site Admin
Posts: 1409
Joined: Tue Nov 04, 2003 7:05 am
Location: Colorado
Contact:

Post by scott »

OK I have changed the stop word list to the minimalist one above, just 32 words, and rebuilt the search index. Let me know how it goes!

There are apparently still issues with searching for "climbing back up" for reasons I don't understand, perhaps because one of the words is so short. A search for "climbing back" gives good results along with all variations of "equivalent effective weight."

Jim, you'll be happy to know now when you search for pulley question ANY terms you get 6023 matches, ALL terms you get 73 matches, and the quoted string "pulley question" you get 19 matches including this post. Big improvement!

There will probably be some side affects, but hopefully the search feature will now work better than it did before in most cases, and without a noticeable performance hit as far as I can tell. If you find problems, please let me know.

Thanks,
Scott
User avatar
Stewart
Devotee
Devotee
Posts: 1352
Joined: Wed Nov 05, 2003 11:04 am
Location: Devon, England

Post by Stewart »

Great job Scott - many thanks!

Stewart
danieljones2006
Dabbler
Dabbler
Posts: 2
Joined: Sun Aug 29, 2010 4:13 pm
Location: suite no.-20, apartment-5, Near W21st street,Zenia,California-95595

Post by danieljones2006 »

Hi,
Good job.. i really appreciated. thanks
Post Reply