Progress: Refine Objectives and Data
- Summary
- What I did
Data specification needed
Problems
- Time-data aspect: Inherent relationship
- Time polling: Disparate data qualities
Solutions
AI-enhanced Value Investing
Objective week_3: Run Matlab Neural Networks
- Specific Goal
- Hypothesis

Progress: Refine Objectives and Data

Summary

I put thought into what kind of data we are examining. I classified the data, gave descriptions of it, compared different types, and brainstormed about how to solve particular problems with these differences. I commented on thoughts into choosing correct outputs, also.

Included in this weeks detailed notes are many ideas and thoughts into how things could be done better. Some of these should eventually be tested, one by one. As you said, precise and small experiments are good methodology. The thougts and ideas of the past 2 weeks, are on top of a ton of other ideas already existing prior. Looking into the future, work needs to be done; specific ideas need to be selected and tested. All of this needs to happen before working on some grand-scheme “genetic algorithm to optimize the topology of the base Neural Net” (which is what I was formerly trying to do in programming VBA stuff). The important part of our work is the Neural Net.

What I did

I focused on rethinking and refining our process, hypothesis, and therefore entire system. I focused on the data requirements, data structure, and data relationships. In particular, I tried to answer how we may handle time-series data.

Among that, I did a number of other things as asides, as listed below:

Handwritten Brainstorm notes that I typed up and organized
Read large excerpts of book “Time Series Prediction: Forecasting the future and understanding the past” by Weigend and Gershenfeld
- this gave insight into pre-implementation best-practices and background knowledge
Read Spanish Markets paper “A hybrid approach based on neural networks and genetic algorithms to the study of profitability in the Spanish Stock Market” by Mariano Matilla-Garcı´a* and Carlos Argu¨ello
- Utilized Evolutionary Algorithm for optimizing the topology of an underlying Neural Network solution
Read current financial news
- investigated its meaning and possible relation to our project
Matlab learning
- studied book
- attended CIRC Matlab seminar
- ran example commands and code

Data specification needed

Roy stated, and I agreed, to think about our data’s relationships, inherent structure (time and company to uniquely identify), organization / portrayal, and how the system might accept future, new, or additional data.

Problems

Time-data aspect: Inherent relationship

Of most immediate importance is the concept of relating data to each instances time aspect... aka when the data was “polled”. Most quantitative financial data has an associated time component with it.

Just like how a Relational Database has to keep track of what attributes determine unique instances of other attributes, the ‘time’ and ‘company/ticker’ attributes may have to depict that primary-key relationship in our case.

If the Neural Net is not aware of this relationship in some way, it may go about randomly associating different data pieces with each other. This may take the network longer, or lead it astray because the solution lanscape is so various when associating random data bits that are not lineated.

Example: Related

time	company	market price	volume	assets	liabilities
8/11	aapl	120	1.4
8/12	aapl	120	1.4
8/13	aapl	122	1.8
8/14	aapl	127	2.0	1,235,841	347,000
8/15	aapl	135	2.4
8/16	aapl	128	1.7

Note: the single day that inclues their assets and liabilities, is the day that Apple released its quarterly report. Fundamentals are on a different polling interval. See below for details.

Example: Not Related

FIXME not correct. The network is being fed input-output pairs for different training instances.

data
8/11
8/12
8/13
8/14
8/15
8/16
aapl
aapl
aapl
aapl
aapl
aapl
120
120
122
127
135
128
1,235,841
347,000

Note: I believe that this is what we were formerly doing.

Time polling: Disparate data qualities

An interesting issue is the vast divide in time-polling intervals for technical (market price) Vs fundamental (balance sheet, etc.)

Technical data is near real time, almost every second you can have a data instance. There also tends to be less number of data attributes.
- Example:
  1. Market Price
  2. Volume
- The numbers are more exact or more-so quantitative. The data can be easily compared on an international standard. There is less room for interpretation??
Fundamental is very sporadic. You get quarterly updates from the actual company. There are many data attributes.
- Example:
  1. cost of goods sold
  2. revenue
  3. taxes
  4. expenditures
  5. assets
  6. owners equity
  7. liabilities
  8. etc., etc., etc.
- The data could have been determined by differing accounting standards or moral decisions. More room for interpretations??

Solutions

What is needed?

If you want to find relationships between data of such differing qualities, it may be beneficial to focus on how a system might associate, compare, relate, and link data on disparate scales.

Future data additions

There are other possible future data types with differing qualities. A solution must try to be broad, so as to possibly be able to include future data types.

Examples:
1. News (parsed and evaluated) (good, neutral, bad)
2. Macro Economic (interest rates, approximate model of system)
3. Sector Specific (other companies, industry knowledge)

Possible Solutions: How to handle time-data relationship

At the very least, it is worth considering that the network be weighted towards a connection between the time and other attributes of the inputs. After being weighted, the system can feel free to associate time instance data to other data instances than its related one.

Examples of inherent data relationships are:

Time polling data → most data attributes
company → most data attributes
~~sector/industry → company~~
- Note: this may be considered to be a calculated or derivative data value. It can easily be looked up. May not be a good example???
Goal Attribute(s) (market value, company, when to buy/sell) → each other

Neural network output-node redefinition

As I made this list, I noticed that in a neural network, it may benefit to associate inherent data relationships. These same relationships also happen to be mysteriously close to what we want as output, or our goal data. So, I added “goal attribute”, which for us has been a future market value.

So, maybe our output nodes should be more specific:

Main goal attribute (usually market value, but not always see note below)
Time wanted (usually in future)
Company of interest (optional)
- You may or may not know which company to focus on beforehand. If you don’t know which company, a.k.a. you didn’t choose companies beforehand to study a.k.a. you may have many companies in the data set, your goal may be to maximize differential market prices. This would forecast best companies of a bunch.

Possible Solutions: How to handle disparate data qualities

Specifically, time-polling data quality differences mentioned above.

Average any missing data

Linearly average any missing values between two time-linked data instances. Remember, with fundamental data you can find (many) dates that data does not exist. See the Example: Related table above for an example. If you average any ‘missing’ data, you can then compare technical and fundamental data on a lower common time interval (real time).

AI-guess any missing data

You could come up with hypothetical AI-based predictions for any non-existent data. For example, instead of linear average from one interval to the next, you could be smarter by using a data pre-processor AI to guess at an up and down during the unknown intervals.

Complete data only / delete incomplete data instances

This is a dumbed down approach where we get rid of all time-intervals where there is incomplete data. For example, you would ignore most market data on a day to day basis and only take market data on days where this is quarterly earnings (fundamental) data.

Deletion with average

Modification of the idea above. You would average or modify the few leftover (not-deleted), more-numerous data types to incorporate the “thrown out” data from the prior intervals. For example, market data on a given day may not represent that the stock averaged much higher over the period of deletion. We could average the not-deleted value toward the entire deleted time interval. Instead of a data value being a snapshot in time, it might be a conglomeration of data over a now-missing time period.

AI-enhanced Value Investing

Associations of past data, in order to predict future movements, is the main premise behind our current project & lines of thought. A large variation has been bouncing in my head for a while.

Value Investing Graham, Dodd, Buffet. Use AI, computing systems, and modern algorithms to enhance and evaluate value positions and decisions. We could use Neural Networks to find associations in past data that netted strong value positions and outcomes. Maybe there are hard to determine patterns.

See AI enhanced value-Investing for more.

Objective week_3: Run Matlab Neural Networks

I do not know how to test the above thoughts and ideas out by hand or as thought experiments. I need quantifiable data and processes to even guess at what the outcome of any of these may be.

What I could do, is run neural nets in Matlab manually, as Matt had been doing for some time. I only did it once. This would likely give me necessary insight into his process and a possible understanding of initial results. I could manually run a handful of Neural Nets with a specific hypothesis or ideas in mind. Roy, it will not allow you to follow the coding details as clearly, because you do not have Matlab knowledge, but I think it is the best way for me to get some results and gain necessary background knowledge.

Specific Goal

Run a Neural Network first with Matt’s original experiment and mark any different or similar results (NNs can provide different results on different runs or setups). My goal is to replicate Matt’s original ‘published’ results, which were apparently positive forecasting abilities.

Then I want to modify the networks topology, specifically the number of hidden layers, and see how these changes in topology change results on the same data. If I have the time or inclination, I could try the same thing but changing the training/learning/initialization algorithms. Matlab supplies many different pre-written functions (with source) for each of the components of a Neural Net run. You can switch any of these in and out, making it very modular. I would like to test that this modular ease does in fact exist, and then see how changing these topology variables effects the results.

Hypothesis

Increasing the number of layers in the neural net of Matt’s experiment will increase the returns (or accuracy) of Matt’s original results. (This entails first a reproduction of Matt’s conditions on Matlab to produce the same results that Matt had).