N Gram analysis

It’s been 2 weeks i’m not posting anything in my blog. Not because there is nothing interesting in this 2 weeks, but this 2 weeks have something to do with my research project proposal and doing some Summer School, and also participating in Hackathon4Nation.

Now, in Sunday Night, Serie - A, Premiere League is off. So i got bored and i don’t want to make my brain not doing anything. So, i’m doing research in N-Gram Analysis. Why i’m researching this ? because it’s have correlation to my research project about comparing text. I write the code in java, and doing it in TDD (YEAYYYYY) and have test about it. You can check on the github and check how to use it in the test.

Ok first #letmeexplainyou N-Gram, according to wikipedia, n-gram is a contiguous sequence of n items from a given sequence of text. So basically it will map sequence of text, and group it, and having key and map it into one dictionary. It used in much different thing such as text analyisis, speech recognition, DNA analysis, speeling correction, and etc. Example if you are familiar with DNA Sequencing,

you have sequence : AGCTTCGA
you can map it with n-Grams where n is 2 : AG, GC, CT, TT, TC, CG, GA

Another common example is to count number of words,

you have word : number of people going to get number of queue is too high.
you can map it into : number = 2, of = 2, people = 1, going = 1, to = 1, get = 1, number = 1, queue = 1, too =1, and high = 1

There are also library that support N-Gram analysis like Apache Lucene that support NGramTokenizer. And one of interesting about it also can read in Google Research Blog about N-Gram analysis. One thing that must be remembered to doing N-Gram is, Sentences that want to analyze using N-Gram must be clean. So first thing that we need to do is pre processing by remove the stem using our corpus dictionary.

So basically the algorithm that i’m used is :

list_of_words = split words by space
n = 1
for word in list_of_words
    for i = 0 to n-1
        if n is not last word
            concat word
    end for
    return concat word

end for

 
3
Kudos
 
3
Kudos

Now read this

Assignment in Gradient Descent Algorithm

Today i learn about Gradient Descent Algorithm. There is something interesting in that equation that very common use in Computer Science. “=” My lecturer in university never said that in programming language like Java, C, or Python this... Continue →