User:Sudeepam: Difference between revisions

Jump to navigation Jump to search
13,023 bytes removed ,  21 December 2018
 
(3 intermediate revisions by the same user not shown)
Line 15: Line 15:


:- I have been coding since my 6th standard and as such, I have developed a working knowledge of how programming problems should be approached.
:- I have been coding since my 6th standard and as such, I have developed a working knowledge of how programming problems should be approached.
:- My areas of interest are Machine Learning, Digital Signal processing, and Algorithms.
:- My areas of interest are Machine Learning, Data Structures, and Algorithms.
:- I learn Machine Learning through online resources such as open research papers, blogs, and MOOCs since my college does not offer a course on this subject till the final year.
:- I learn Machine Learning through online resources such as open research papers, blogs, and MOOCs since my college does not offer a course on this subject till the final year.
:- Digital Signal Processing is one of the core subjects of my major. I have been learning that, and related subjects, such as 'Basics of Signals and Systems' for an year now.
:- I have been coding for many years now, knowledge of Data Structures and Algorithms is something that I have gained through those years of practice and repeated attempts to optimize my code.
:- I have been coding for many years now, knowledge of Algorithms is something that I have gained through those years of practice and repeated attempts to optimize my code.
:- I will be taking a formal course on Algorithms in my next semester.


*'''Why do you want to participate in the Google Summer of Code? What do you hope to gain by doing so?'''
*'''Why do you want to participate in the Google Summer of Code? What do you hope to gain by doing so?'''
Line 62: Line 60:
:'''MATLAB m-scripts:''' I am highly experienced with MATLAB m-scripts. A lot of my coursework assignments involve making m-scripts. Those assignments, have even asked me to make my own implementations of inbuilt MATLAB functions and I have been using that experience to contribute to Octave-Forge signal package. In addition to this, I also implement machine learning in Octave and have gained further experience of m-scripts by doing that. If selected, for GSoC 2018, I will approach my project, with m-scripts and so I consider my experience to be a big plus point.
:'''MATLAB m-scripts:''' I am highly experienced with MATLAB m-scripts. A lot of my coursework assignments involve making m-scripts. Those assignments, have even asked me to make my own implementations of inbuilt MATLAB functions and I have been using that experience to contribute to Octave-Forge signal package. In addition to this, I also implement machine learning in Octave and have gained further experience of m-scripts by doing that. If selected, for GSoC 2018, I will approach my project, with m-scripts and so I consider my experience to be a big plus point.


:'''C++:''' I am familiar with C++ but I have never made a formal project using it and I haven't been using it lately. The language will not be new to me and I can quickly revise it if required. I was taught C and C++ in my first year, in a course called "Software Development Fundamentals". At the time I decided to use C and not C++ for my project. I passed that subject with an 'A grade' is all I can currently say about my C++ experience.
:'''C++:''' I am quite familiar with the C++ programming language and have used it for personal projects and problem solving.


:'''OpenGL and Qt:''' I have never used OpenGL before. I have never used Qt either but I've seen some Qt code of my friends.
:'''OpenGL and Qt:''' I have never used OpenGL before. I have n:- I will be taking a formal course on Algorithms in my next semester.ever used Qt either but I've seen some Qt code of my friends.


*'''Please describe your experience with other programming languages.'''
*'''Please describe your experience with other programming languages.'''
:'''Python:''' I am proficient in Python. I use it as my go-to language for various tasks, mostly Machine Learning.


:'''C:''' I am comfortable with C and have made a number of projects using it.
:'''C:''' I am comfortable with C and have made a number of projects using it.


:'''JAVA:''' I have 4 years of experience with JAVA, but I haven't used it for making projects. I use this language primarily for problem solving (competitive programming) questions.
:'''JAVA:''' I have 4 years of experience with JAVA, but I haven't used it for making projects. I use this language primarily for problem solving (competitive programming) questions.
:'''Python:''' I am familiar with the language but I still am learning to use it effectively. I am learning it so that I can use it for Machine Learning problems in the future if required.


*'''Please describe your experience with being in a development team. ''Do you have experience working with open source or free projects?'' '''
*'''Please describe your experience with being in a development team. ''Do you have experience working with open source or free projects?'' '''
Line 168: Line 166:
* '''Did you select a task from our list of proposals and ideas? If yes, what task did you choose? Please describe what part of it you especially want to focus on if you can already provide this information.'''
* '''Did you select a task from our list of proposals and ideas? If yes, what task did you choose? Please describe what part of it you especially want to focus on if you can already provide this information.'''


::Yes, I have decided to work on the '''command line suggestion feature [https://savannah.gnu.org/bugs/?46881]'''. This feature is essentially a complex, decision making problem and therefore, I will approach it using Artificial Neural Networks, made using Octave (m-scripts) itself.
::Yes, I have decided to work on the '''command line suggestion feature [https://savannah.gnu.org/bugs/?46881]'''.
 
::''My special focus would be to have a minimal trade-off between the accuracy and speed of the feature.'' Please take a look at the last, and additional section of 'Project Description'[https://wiki.octave.org/User:Sudeepam#Project_Description] for technical details. Please consider reading the Project Description section before going through the tentative timeline which is given below. I would like to apologize for creating this extra part but it describes some of the important technicalities of this project and I believed that they should be included.
 
*'''Please provide a rough estimated timeline for your work on the task.'''
 
:'''Preparations for the project (pre-community bonding)'''
::While this application is being reviewed, I have started working on a m-script which will be used to catch the most common typographic errors that the users make. This list of errors could then be...
 
::-Uploaded to a secure server directly.
::-Stored as a text file, and we can ask the users to share this file with us.
 
:'''Community Bonding period:'''
 
::I will use the community bonding period to...
 
::-Persuade the community to use our data extraction script and help us collect training, cross-validation, and test data. This can be done by discussing the benefits of a command line suggestion feature and by sharing this[https://github.com/Sudeepam97/Did_You_Mean] small demonstrative implementation that I have created.
 
::-Ask the community to report issues with the m-scripts containing the demo implementation. I’ll shift the demo implementation to mercurial if required.
 
::-Discuss the project with the mentors in detail.
 
::-Discuss how we should recieve the data generated by the users, implement the approach, and organize the data as it is received and divide it to create proper, training, cross-validation, and test sets for the Neural Network.
 
:'''Phase1, May, 14 – June, 10 (4 weeks):'''
 
::'''Week 1 (May, 14 – May, 21):''' I would not be able to do a lot of work in this week as I have my final examinations during this time. I will take this week as an extension of the community bonding period and use it to collect issues, collect more data and divide it into proper data-sets.
::'''Week 2 and Week 3 (May, 21 – June, 3):''' Most of the code of the Neural Network would be identical to my demo implementation and so I’ll start by making my demo implementation bug free (Some known issues can be found here: [https://github.com/Sudeepam97/Did_You_Mean/issues]) and by coding it according to the Octave coding standards. I plan to keep the user data coming for these weeks also and so I’ll leave room for network parameters such as the number of hidden layers and the number of neurons per hidden layer because these are data dependent parameters. If all this work gets completed before the expected time, I’ll automatically move on to complete next week’s work.
::'''Week 4 (June, 4 – June, 10):''' By now we will have sufficient data, this will include the data received from octave-online.net[http://lists.gnu.org/archive/html/octave-maintainers/2018-03/msg00248.html] and from approximately 6 weeks of extraction script’s usage. I’ll quickly give a final look to the data and start training the Neural Network with it. I will choose appropriate values of the data dependent network parameters which, while keeping the speed of the Neural Network fast, would fit the learning parameters (weights) of the Neural Network to our data with a high level of accuracy. I would then measure, and maximize, the accuracy of the Network on cross validation and test sets and see how our Network generalizes to unknown typographic errors. I will also write some additional tests for the set of m-scripts used.
 
::'''Phase 1 evaluations goal:''' A set of working Neural Network m-scripts, which could suggest corrections for typographic errors.
 
:'''Phase2, June, 11 – July, 8 (4 weeks):'''
 
::'''Week 5 (June, 11 – June, 17):''' I’d like to take this week to work in close connection with the community and perform tests on the newly created m-scripts. Essentially, I’ll be asking the community to try out our m-scripts and see how they work for them. I will work on the issues pointed out by the community and by the mentors as they are reported.
::'''Week 6 (June, 17 – June, 24):''' I’ll fix any remaining issues and proceed to discuss and understand how our Neural Network should be integrated with Octave. I’ll start working on integrating the network as soon as the approach is decided. It is worth mentioning here that we will merge a '''trained network''' with Octave and therefore the chances of our code being slow are eliminated.
::'''Week 7 – Week 8 (June, 25 – June, 8):''' I will integrate our neural network with Octave as discussed, and write, and perform tests to make sure that everything works the way it should. If this task gets completed earlier than expected, I’ll automatically move on to the next task.
 
::'''Phase 2 evaluations goal:''' A development version of Octave which has a command line suggestion feature (currently there will be no mechanism available to easily select the corrections suggested and easily enable/disable this feature).
 
:'''Phase 3, July, 9 – August, 5 (4 weeks):'''
 
::'''Week 9 (July, 9 – July, 15):''' The development version of Octave, with an inbuilt suggestion feature will be open for error reports. I’ll work on the issues as they are reported and also discuss what an easy enable/disable mechanism and the mechanism to easily select the corrections suggested should be like.
::'''Week 10 (July, 16 – July, 22):''' I’ll create the required mechanisms as discussed, write and perform tests, and push a development version with a complete command line suggestion feature.
::'''Week, 11 – Week, 12 (July, 23 – August, 5):''' I’ll work in close connection with the community, fix the issues that are reported, and ask for further suggestions on how the command line suggestion feature could be made better.
 
::'''Phase 3 evaluations goal:''' A development version of Octave with a complete and working command line suggestion feature, open to feedback and criticisms
 
:'''Last days of GSoC:'''
::During the last days of GSoC, I’ll try to improve the command line suggestion feature, based on the feedback received.
 
== Project Description ==
 
My special focus is to have ''a minimal trade-off between the speed and accuracy of the feature''. Before talking about that, let me first describe the three kinds of Neural Networks (Depending on the training data available) that we can end up making.
 
:'''1) A network trained with only the correct spellings of the inbuilt functions'''
This type of network would be very easy to make because only a list of all the existing functions of GNU Octave and no additional data will be required. With this approach, we would end up creating a Neural Network which would easily understand typographic errors caused due to '''letter substitutions '''and '''transportation of adjacent letters.''' In-fact, this network would understand multiple letter substitutions and transportations also and not only single letter substitutions or transportations. I am confident about this because I have already made a working neural network of this type [https://github.com/Sudeepam97/Did_You_Mean]. This network would however, perform poorly if an error is caused due to '''accidental inclusion''' or '''accidental deletion of letters.'''
 
:'''2) A network trained with the correct spellings of the functions and self created errors'''
This would be slightly harder to make but should give us an improved performance. I will '''create some misspellings''' for all the functions, by additional inclusion, deletion, substitution, and transportation of one or two letters and then add all these self created misspellings to the data-set which will be used to train the network. Such a network would understand what '''correct spellings and random typographic errors''' look like. It will easily understand substitutions and transportations like the previous network but should also be more accurate while predicting errors caused due to additions/deletions. However, it is worth mentioning here that ''we may create errors while creating errors.'' Because our training data will be randomly modified for this network, although the chances are rare, the Neural Network may show uncertain behaviour.
 
:'''3) A network trained with the correct spellings of the functions and the most common typographic errors'''
To make this kind of Neural Network, we need to know what common typographic errors look like. With that goal in mind, I have already contacted the people behind octave-online.net [https://octave-online.net/] who say that they are happy to support the development of GNU Octave and as of now (25th March), have shared a list of top 1000 misspellings with me through email. However the users of octave-online.net are only, one of the parts of the entire user group. '''For best results''', we would require the involvement of the entire Octave community, which, also implies that it will be the hardest and the most fun Neural Network to make.
 
By creating a script that would be able to catch typographic errors and by asking the users of GNU Octave to use this script and share the most common spelling errors with us, and training the network on the data-set thus created, we’ll create a Neural Network which would understand what '''correct spellings and the most common typographic errors''' look like. Such a network would give good results, almost every-time and with all kinds of errors. This is because when our network knows what common errors are like, most of the times it would '''know the answer''' beforehand. For the remaining times, the network would be able to '''predict the correct answer'''.
 
At a later stage (possibly after GSoC), I could merge the data extraction script with Octave so that the performance of the Network could be improved with time. This could come with an easy disable feature, so that only the users who would like to share their spelling errors would do so.
 
I understand that using Neural Networks may seem like an overkill and that one could think about using traditional data structures like trie, or algorithms like 'edit distance' which are made for exactly these kinds of problems.
 
However, edit distance, while being accurate, would be the slowest approach of the three because we essentially would need to calculate the edit distance between the input and '''all the functions''' of Octave and tries, though fast, would not be able to generalize to unknown typographic errors. Neural networks, however, when trained with proper data, would be highly accurate, would generalize to unknown typographic errors, and because of the fact that ultimately '''a 'trained' Neural Network''' will be merged with Octave, this approach will be fast as well.
 
Another disadvantage when using trie that I'd like to mention here is that, if, say, we are unable to arrange a sufficiently large list of common spelling errors or if an errors is made while typing the first few characters of the function, a trie would fail miserably, however, a neural network even in that case, would easily identify letter substitutions and transportations of adjacent letters.
 
This is why, after due consideration, as described above, to me, '''neural networks look like the best solution to minimize the trade-off between speed and accuracy of the feature''' and this is the reason why I have chosen to use them.
 
Also, when using Neural Networks, we '''have an option''' to not decide a definition of 'close' because, the neural network, because of sigmoid activation, will find the most probable match on its own.
 
However, 1 out of 100 times, a neural network could make very ambiguous predictions and so for an even better user experience, what we could add, is a 'control' to make sure that a very ambiguous output by the Neural Network is not shown to the user.  A simple control could be a 'close' edit distance between the output of the neural network and the misspelled input that the user gave. Again, if this extra feature, for an even better UX is added, a close will have to be defined.
98

edits

Navigation menu